基于Hadoop的分布式文件系統(tǒng)技術(shù)分析及應(yīng)用

發(fā)布時(shí)間：2018-07-17 07:47

【摘要】：隨著互聯(lián)網(wǎng)(主要為移動互聯(lián)網(wǎng))和新興物聯(lián)網(wǎng)的高速發(fā)展,我們生活在一個(gè)數(shù)據(jù)大爆炸時(shí)代。根據(jù)IDC估計(jì),2011年,全球產(chǎn)生和創(chuàng)建的數(shù)據(jù)總量為1.8ZB,且全球的信息總量每過兩年就會增長一倍。產(chǎn)生這么多的數(shù)據(jù),自然而然就會給我們在數(shù)據(jù)存儲和管理上帶來巨大的挑戰(zhàn)。IDC的研究報(bào)告還指出,全球數(shù)據(jù)存儲容量的增長速度已遠(yuǎn)遠(yuǎn)跟不上的數(shù)據(jù)的增長速度了。這么多的數(shù)據(jù)存儲在一個(gè)設(shè)備上在當(dāng)今的存儲技術(shù)下是很難辦到的,并且存儲在一個(gè)設(shè)備上,會對以后數(shù)據(jù)的分析帶來很大的困難。把數(shù)據(jù)存儲在多個(gè)設(shè)備上,是我們現(xiàn)今存儲海量數(shù)據(jù)的首選。既然存儲在多個(gè)存儲設(shè)備上,那么就需要我們有相應(yīng)的分布式文件系統(tǒng)來管理這些存儲設(shè)備,使它們能夠協(xié)同工作,并可以向用戶提供更好的數(shù)據(jù)訪問性能。 Hadoop分布式文件系統(tǒng)(HDFS),一個(gè)類似Google的分布式文件系統(tǒng)(GFS)的出現(xiàn)是可以解決海量數(shù)據(jù)存儲需求的一個(gè)很好應(yīng)用。首先它是一個(gè)開源免費(fèi)的應(yīng)用并且在很多節(jié)點(diǎn)上已經(jīng)部署,具有不凡的表現(xiàn)。其次,HDFS擁有高容錯(cuò)性、高可靠性、高擴(kuò)展性和高吞吐率等特征,這些特征都為海量數(shù)據(jù)提供了安全存儲的環(huán)境和對超大數(shù)據(jù)集(Large Data Set)的應(yīng)用處理帶來了很大便利。它還可以與MapReduce編程模型很好的結(jié)合,并且能夠?yàn)閼?yīng)用程序提供高吞吐量的數(shù)據(jù)訪問。在本論文中,首先以時(shí)間為軸,介紹了每個(gè)時(shí)代典型的分布式文件系統(tǒng)及其特點(diǎn),然后對HDFS的體系架構(gòu)和運(yùn)行原理進(jìn)行了詳細(xì)分析。通過對HDFS高可用性的研究,結(jié)合了BackupNode和AvatarNode這兩種方案的優(yōu)點(diǎn)設(shè)計(jì)出了一個(gè)高可用的分布式文件系統(tǒng),我們稱之為HADFS。該文件系統(tǒng)不僅實(shí)現(xiàn)了NameNode的熱備節(jié)點(diǎn),還可以在當(dāng)NameNode節(jié)點(diǎn)發(fā)生故障時(shí),能夠自動切換到備用節(jié)點(diǎn),而用戶卻察覺不到節(jié)點(diǎn)的切換。最后,我們以HDFS為基礎(chǔ)存儲層設(shè)計(jì)出了一個(gè)可以實(shí)現(xiàn)文件上傳、下載、新建文件夾和刪除文件等功能的云盤系統(tǒng)。該系統(tǒng)采用了SSH框架設(shè)計(jì),并在與HDFS傳輸數(shù)據(jù)的時(shí)候采用了webdav協(xié)議,使云盤的前端與底層存儲實(shí)現(xiàn)了很好的分離。
[Abstract]:With the rapid development of the Internet (mainly mobile Internet) and the emerging Internet of things, we live in a data Big Bang era. According to IDC estimates, the total amount of data generated and created globally was 1.8 ZB in 2011, and the global amount of information doubled every two years. Generating so much data naturally poses a huge challenge in data storage and management. IDC's report also points out that the growth of global data storage capacity is far from keeping up with the growth of data. It is very difficult to store so much data on one device under the current storage technology, and it will bring great difficulty to the analysis of data in the future. Storing data on multiple devices is our preferred choice for storing massive amounts of data today. Since it is stored on multiple storage devices, we need to have the appropriate distributed file systems to manage these storage devices so that they can work together, Hadoop distributed file system (HDFS), a distributed file system similar to Google (GFS), is a good application to solve the requirement of massive data storage. First, it is an open source free application and has been deployed on many nodes, with extraordinary performance. Secondly, HDFS has the characteristics of high fault tolerance, high reliability, high scalability and high throughput. These features provide a secure storage environment for massive data and great convenience for the application and processing of large data sets. It also combines well with MapReduce programming model and provides high throughput data access for applications. In this paper, the typical distributed file system and its characteristics in each era are introduced on the axis of time, and then the architecture and running principle of HDFS are analyzed in detail. By studying the high availability of HDFS, combining the advantages of backup Node and Avatar Node, a highly available distributed file system is designed, which we call HADFS. The file system not only implements the hot node of NameNode, but also can automatically switch to the standby node when the node of NameNode fails, but the user can not detect the switch of the node. Finally, we design a cloud disk system which can upload, download, create new folder and delete files based on HDFS. The system is designed by SSH framework, and webdav protocol is used to transmit data with HDFS, which makes the front end of the cloud disk separate from the underlying storage.
【學(xué)位授予單位】：武漢理工大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP333;TP316.4

【參考文獻(xiàn)】

相關(guān)期刊論文前4條

1 孫燕,田俊峰,王鳳先;分布式冗余管理系統(tǒng)可靠性的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;2003年15期

2 朱強(qiáng);多服務(wù)器模型下的服務(wù)器選擇算法及仿真[J];計(jì)算機(jī)工程與應(yīng)用;2005年29期

3 譚支鵬;馮丹;;對象存儲系統(tǒng)形式化研究[J];計(jì)算機(jī)科學(xué);2006年12期

4 陸榮幸,郁洲,阮永良,王志強(qiáng);J2EE平臺上MVC設(shè)計(jì)模式的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用研究;2003年03期

相關(guān)碩士學(xué)位論文前3條

1 林松濤;基于Lustre文件系統(tǒng)的并行I/O技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2004年

2 趙春燕;云環(huán)境下作業(yè)調(diào)度算法研究與實(shí)現(xiàn)[D];北京交通大學(xué);2009年

3 楊平安;基于Paxos算法的HDFS高可用性的研究與設(shè)計(jì)[D];華南理工大學(xué);2012年

，

本文編號：2129658

資料下載