基于HDFS的電子文件集中存儲和檢索系統(tǒng)
發(fā)布時間:2018-07-15 09:54
【摘要】:我國電子文件隨著政府信息化進程的推進得到很大的發(fā)展,政府工作中產(chǎn)生的電子文件數(shù)量已經(jīng)超過紙質(zhì)文件數(shù)量。相對于紙質(zhì)文件的管理方式,電子文件的管理還不成熟,特別在存儲方面,電子文件憑借其自身易于傳輸和保存的特點,可以不在局限于按照地域分散存儲。對電子文件進行集中存儲可以有效的加強電子文件的管控力度,提高辦公效率,減少人力資源開銷,并解決文件丟失、泄露等問題。但同時怎樣實現(xiàn)海量電子文件的集中存儲直接影響到整個系統(tǒng)的實現(xiàn)和效率。云存儲是一個網(wǎng)絡在線存儲模型,數(shù)據(jù)被存儲在存儲虛擬池中,只要硬件容許它幾乎可以提供無限的廉價存儲能力。云存儲技術可以高效的解決海量電子文件集中存儲問題;贕oogle File System(GFS)設計思想的開源云存儲文件系統(tǒng)Hadoop Distributed File System(HDFS)憑借其出色的處理超大文件的性能和可靠性成為云存儲技術研究的熱點。而電子政務中的電子文件以小文件為主,HDFS在處理海量小文件的存儲和訪問時性能低下。 本文針對HDFS處理小文件的不足,提出一種通過使用存儲緩存和讀取緩存的策略來提高海量小文件的存儲和訪問效率。其基本思想為設計實現(xiàn)HDFS中間件在滿足存儲訪問需求的同時減少HDFS的訪問次數(shù),從而提高存儲訪問效率。存儲緩存策略的基本思想為設置多個緩沖區(qū),存儲小文件時通過多個緩沖區(qū)的優(yōu)化選擇來提高緩沖區(qū)的利用率,從而減少HDFS訪問次數(shù)。讀取緩存策咯的基本思想為使用buddy system的方式管理固定大小的整個讀取緩存,并為每個分段緩存設置效率閾值,通過效率閾值來控制緩存的更新策略,最大限度提高緩存利用率,從而使訪問文件時盡可能的利用讀取緩存,減少訪問HDFS的次數(shù)。本文在安全性方面也有一些策略設置,通過使用多級加密的形式來保證電子文件的集中存儲訪問過程中的機密性和隱私性。最后,本文實現(xiàn)原型系統(tǒng)并進行測試分析,以證明以上思想方法的可行性和可用性。
[Abstract]:With the development of government informatization, the number of electronic documents produced in government work has exceeded the number of paper documents. Compared with the management mode of paper files, the management of electronic files is not mature, especially in the storage, electronic files can not be limited to distributed storage according to their own characteristics of easy transmission and preservation. Centralized storage of electronic files can effectively strengthen the control of electronic documents, improve office efficiency, reduce the cost of human resources, and solve the problems of file loss and leakage. However, how to realize the centralized storage of massive electronic files directly affects the implementation and efficiency of the whole system. Cloud storage is a network online storage model, where data is stored in a virtual pool, as long as the hardware allows it to provide almost unlimited cheap storage capacity. Cloud storage technology can efficiently solve the problem of mass electronic file centralized storage. Hadoop distributed File system (HDFS), an open source cloud storage file system (HDFS) based on Google File system (GFS), has become a hot topic in cloud storage technology because of its excellent performance and reliability in processing large files. However, in E-government, small files are the main function of HDFS in dealing with the storage and access of large amount of small files. Aiming at the shortage of HDFS in dealing with small files, this paper proposes a strategy of using storage cache and reading cache to improve the storage and access efficiency of large amount of small files. The basic idea is to design and implement HDFS middleware to meet the storage access requirements and reduce the number of HDFS access so as to improve storage access efficiency. The basic idea of storage cache policy is to set up multiple buffers, and to improve the utilization of buffers by optimizing the selection of buffers when storing small files, thus reducing the number of HDFS visits. The basic idea of reading cache policy is to use buddy system to manage the whole read cache of fixed size, and set the efficiency threshold for each segment cache. The update strategy of cache is controlled by the efficiency threshold, and the cache utilization is maximized. In order to access the file as much as possible to use read cache, reduce the number of visits to HDFS. This paper also has some policy settings in the aspect of security, by using the form of multi-level encryption to ensure the confidentiality and privacy in the process of centralized storage and access of electronic files. Finally, the prototype system is implemented and tested to prove the feasibility and availability of the above methods.
【學位授予單位】:南京大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP333;TP391.3
[Abstract]:With the development of government informatization, the number of electronic documents produced in government work has exceeded the number of paper documents. Compared with the management mode of paper files, the management of electronic files is not mature, especially in the storage, electronic files can not be limited to distributed storage according to their own characteristics of easy transmission and preservation. Centralized storage of electronic files can effectively strengthen the control of electronic documents, improve office efficiency, reduce the cost of human resources, and solve the problems of file loss and leakage. However, how to realize the centralized storage of massive electronic files directly affects the implementation and efficiency of the whole system. Cloud storage is a network online storage model, where data is stored in a virtual pool, as long as the hardware allows it to provide almost unlimited cheap storage capacity. Cloud storage technology can efficiently solve the problem of mass electronic file centralized storage. Hadoop distributed File system (HDFS), an open source cloud storage file system (HDFS) based on Google File system (GFS), has become a hot topic in cloud storage technology because of its excellent performance and reliability in processing large files. However, in E-government, small files are the main function of HDFS in dealing with the storage and access of large amount of small files. Aiming at the shortage of HDFS in dealing with small files, this paper proposes a strategy of using storage cache and reading cache to improve the storage and access efficiency of large amount of small files. The basic idea is to design and implement HDFS middleware to meet the storage access requirements and reduce the number of HDFS access so as to improve storage access efficiency. The basic idea of storage cache policy is to set up multiple buffers, and to improve the utilization of buffers by optimizing the selection of buffers when storing small files, thus reducing the number of HDFS visits. The basic idea of reading cache policy is to use buddy system to manage the whole read cache of fixed size, and set the efficiency threshold for each segment cache. The update strategy of cache is controlled by the efficiency threshold, and the cache utilization is maximized. In order to access the file as much as possible to use read cache, reduce the number of visits to HDFS. This paper also has some policy settings in the aspect of security, by using the form of multi-level encryption to ensure the confidentiality and privacy in the process of centralized storage and access of electronic files. Finally, the prototype system is implemented and tested to prove the feasibility and availability of the above methods.
【學位授予單位】:南京大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP333;TP391.3
【相似文獻】
相關期刊論文 前10條
1 肖美華,劉文革;優(yōu)化文件分配及磁盤文件存儲之策略[J];南昌航空工業(yè)學院學報;2001年01期
2 嚴小衛(wèi);;通過改變文件分配簇進行的加密和解密[J];微型機與應用;1990年11期
3 陳俊杰,張武生,沈美明,鄭緯民;文件分配問題的一種動態(tài)解決算法[J];小型微型計算機系統(tǒng);2004年07期
4 邵志毅;;文件恢復的可行性分析[J];陜西師范大學學報(自然科學版);2007年S2期
5 賀新征;費金龍;劉楠;祝躍飛;;基于文件過濾驅(qū)動的數(shù)據(jù)安全系統(tǒng)的研究與實現(xiàn)[J];微電子學與計算機;2008年03期
6 王明哲;;試談根據(jù),
本文編號:2123688
本文鏈接:http://www.lk138.cn/kejilunwen/jisuanjikexuelunwen/2123688.html
最近更新
教材專著