中小學(xué)網(wǎng)站教育信息化話題發(fā)現(xiàn)與趨勢分析
發(fā)布時間:2018-04-16 14:21
本文選題:教育信息化 + 熱點(diǎn)話題發(fā)現(xiàn) ; 參考:《南京師范大學(xué)》2016年碩士論文
【摘要】:教育信息化是一個國家和地區(qū)教育發(fā)展程度的重要象征。隨著互聯(lián)網(wǎng)技術(shù)的發(fā)展和教育信息化建設(shè)的大力需求,我國中小學(xué)紛紛建立了學(xué)校網(wǎng)站作為宣傳和交流的載體和平臺。面對學(xué)校網(wǎng)站上頻繁更新的海量新聞報道,從海量數(shù)據(jù)中快速有效地發(fā)現(xiàn)教育信息化相關(guān)話題并進(jìn)行持續(xù)跟蹤是當(dāng)下研究的熱點(diǎn)。本文在話題發(fā)現(xiàn)的基礎(chǔ)上提出了一個可以有效處理大量數(shù)據(jù)的挖掘信息流中潛在知識的教育信息化話題發(fā)現(xiàn)系統(tǒng)。該系統(tǒng)主要包括教育信息化本地話題檢測和話題發(fā)現(xiàn)兩部分。本地話題檢測是采用模式匹配的方式對教育信息化相關(guān)話題的過濾過程,而話題發(fā)現(xiàn)是對本地話題進(jìn)行增量式層次聚類的過程,該過程將潛在知識表示成一個具有層次的話題,每個話題包含一系列的與之相關(guān)的文檔。本文的主要研究工作包括以下內(nèi)容:1.解決了非結(jié)構(gòu)化大量數(shù)據(jù)的采集和存儲問題。網(wǎng)頁數(shù)據(jù)具有更新頻繁、數(shù)量龐大等特點(diǎn),本文通過Hadoop分布式集群的搭建和對網(wǎng)絡(luò)爬蟲Nutch的二次開發(fā)很好解決了這一問題。分布式集群和Nutch的結(jié)合很好的解決了數(shù)據(jù)采集速度的難題,HBase分布式數(shù)據(jù)庫的應(yīng)用使大量無結(jié)構(gòu)的網(wǎng)頁數(shù)據(jù)的存儲變得簡單。2.提出了一種針對中小學(xué)網(wǎng)站的信息抽取方法。本文針對中小學(xué)網(wǎng)站頁面的結(jié)構(gòu)特點(diǎn)綜合利用開源工具包Jsoup、模式匹配和行塊分布函數(shù)開發(fā)了抽取網(wǎng)頁信息的方法。Jsoup主要用于提取網(wǎng)頁中的標(biāo)簽信息,如title、keywords、description等;模式匹配主要用于抽取網(wǎng)頁的發(fā)布時間;行塊分布函數(shù)的作用是提取網(wǎng)頁正文。同時,將抽取的信息為每個網(wǎng)頁建立一個Java類。3.對MapReduce分布式編程模型進(jìn)行了深入研究和分析。為了解決大量數(shù)據(jù)計(jì)算問題,將TF-IDF計(jì)算公式、余弦夾角和聚類算法重新設(shè)計(jì)使其可以運(yùn)行在MapReduce編程模型上,為整個話題發(fā)現(xiàn)過程奠定了基礎(chǔ)。最后,針對中小學(xué)網(wǎng)站和中國教育信息化網(wǎng)站數(shù)據(jù)進(jìn)行了實(shí)驗(yàn),并對實(shí)驗(yàn)結(jié)果從話題的時間頻率和話題內(nèi)容變化趨勢上進(jìn)行分析。實(shí)驗(yàn)結(jié)果表明中小學(xué)網(wǎng)站中教育信息化相關(guān)話題與中國教育信息化網(wǎng)站相比在時間點(diǎn)上稍微有延遲,同時話題的內(nèi)容也較為分散但整體發(fā)展趨勢是一致的,這也表明本文提出的方法是行之有效的。
[Abstract]:Educational informatization is an important symbol of the development of education in a country and region.With the development of Internet technology and the great demand of educational information construction, primary and secondary schools in China have established school websites as a carrier and platform for propaganda and communication.In the face of the frequent updates of mass news reports on the school website, it is a hot topic to quickly and effectively discover the educational information related topics from the massive data and continue to track them.On the basis of topic discovery, this paper proposes a topic discovery system for educational informatization, which can effectively deal with a large amount of data and mine the potential knowledge in the information flow.The system mainly includes two parts: local topic detection and topic discovery.Local topic detection is a filtering process of educational information related topics by pattern matching, and topic discovery is a process of incremental hierarchical clustering of local topics, which represents potential knowledge as a hierarchical topic.Each topic contains a series of related documents.The main research work of this paper includes the following contents: 1. 1.The problem of collecting and storing large amount of unstructured data is solved.The web page data has the characteristics of frequent updating and large quantity. This paper solves this problem very well through the construction of Hadoop distributed cluster and the secondary development of Nutch, a web crawler.The combination of distributed cluster and Nutch solves the difficult problem of data acquisition speed. The application of HBase distributed database makes the storage of large amount of unstructured web page data easy. 2.This paper presents a method of information extraction for primary and secondary school websites.According to the structural characteristics of primary and secondary school web pages, this paper develops a method of extracting web page information by using open source toolkits Jsoup, pattern matching and line block distribution function. Jsoup is mainly used to extract tag information from web pages, such as titlenkeywordsdescription, etc.Pattern matching is mainly used to extract the publishing time of web pages, and the function of row block distribution function is to extract the text of web pages.At the same time, the extracted information will be created for each web page a Java class. 3. 3.The distributed programming model of MapReduce is deeply studied and analyzed.In order to solve the problem of large amount of data calculation, the TF-IDF formula, cosine angle and clustering algorithm are redesigned to run on the MapReduce programming model, which lays the foundation for the whole topic discovery process.Finally, the data of primary and secondary school websites and Chinese educational information websites are tested, and the experimental results are analyzed from the time frequency of topics and the changing trend of topic content.The experimental results show that there is a slight delay in the time point between the educational informatization related topics in the primary and secondary school websites and the Chinese educational informatization websites. At the same time, the content of the topics is more scattered but the overall development trend is consistent.It also shows that the proposed method is effective.
【學(xué)位授予單位】:南京師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:G434
,
本文編號:1759294
本文鏈接:http://www.lk138.cn/jiaoyulunwen/jiaoyutizhilunwen/1759294.html
最近更新
教材專著