中小學網站教育信息化話題發(fā)現(xiàn)與趨勢分析

發(fā)布時間：2018-04-16 14:21

本文選題：教育信息化 + 熱點話題發(fā)現(xiàn)��；參考：《南京師范大學》2016年碩士論文

【摘要】：教育信息化是一個國家和地區(qū)教育發(fā)展程度的重要象征。隨著互聯(lián)網技術的發(fā)展和教育信息化建設的大力需求,我國中小學紛紛建立了學校網站作為宣傳和交流的載體和平臺。面對學校網站上頻繁更新的海量新聞報道,從海量數(shù)據(jù)中快速有效地發(fā)現(xiàn)教育信息化相關話題并進行持續(xù)跟蹤是當下研究的熱點。本文在話題發(fā)現(xiàn)的基礎上提出了一個可以有效處理大量數(shù)據(jù)的挖掘信息流中潛在知識的教育信息化話題發(fā)現(xiàn)系統(tǒng)。該系統(tǒng)主要包括教育信息化本地話題檢測和話題發(fā)現(xiàn)兩部分。本地話題檢測是采用模式匹配的方式對教育信息化相關話題的過濾過程,而話題發(fā)現(xiàn)是對本地話題進行增量式層次聚類的過程,該過程將潛在知識表示成一個具有層次的話題,每個話題包含一系列的與之相關的文檔。本文的主要研究工作包括以下內容：1.解決了非結構化大量數(shù)據(jù)的采集和存儲問題。網頁數(shù)據(jù)具有更新頻繁、數(shù)量龐大等特點,本文通過Hadoop分布式集群的搭建和對網絡爬蟲Nutch的二次開發(fā)很好解決了這一問題。分布式集群和Nutch的結合很好的解決了數(shù)據(jù)采集速度的難題,HBase分布式數(shù)據(jù)庫的應用使大量無結構的網頁數(shù)據(jù)的存儲變得簡單。2.提出了一種針對中小學網站的信息抽取方法。本文針對中小學網站頁面的結構特點綜合利用開源工具包Jsoup、模式匹配和行塊分布函數(shù)開發(fā)了抽取網頁信息的方法。Jsoup主要用于提取網頁中的標簽信息,如title、keywords、description等；模式匹配主要用于抽取網頁的發(fā)布時間；行塊分布函數(shù)的作用是提取網頁正文。同時,將抽取的信息為每個網頁建立一個Java類。3.對MapReduce分布式編程模型進行了深入研究和分析。為了解決大量數(shù)據(jù)計算問題,將TF-IDF計算公式、余弦夾角和聚類算法重新設計使其可以運行在MapReduce編程模型上,為整個話題發(fā)現(xiàn)過程奠定了基礎。最后,針對中小學網站和中國教育信息化網站數(shù)據(jù)進行了實驗,并對實驗結果從話題的時間頻率和話題內容變化趨勢上進行分析。實驗結果表明中小學網站中教育信息化相關話題與中國教育信息化網站相比在時間點上稍微有延遲,同時話題的內容也較為分散但整體發(fā)展趨勢是一致的,這也表明本文提出的方法是行之有效的。
[Abstract]:Educational informatization is an important symbol of the development of education in a country and region.With the development of Internet technology and the great demand of educational information construction, primary and secondary schools in China have established school websites as a carrier and platform for propaganda and communication.In the face of the frequent updates of mass news reports on the school website, it is a hot topic to quickly and effectively discover the educational information related topics from the massive data and continue to track them.On the basis of topic discovery, this paper proposes a topic discovery system for educational informatization, which can effectively deal with a large amount of data and mine the potential knowledge in the information flow.The system mainly includes two parts: local topic detection and topic discovery.Local topic detection is a filtering process of educational information related topics by pattern matching, and topic discovery is a process of incremental hierarchical clustering of local topics, which represents potential knowledge as a hierarchical topic.Each topic contains a series of related documents.The main research work of this paper includes the following contents: 1. 1.The problem of collecting and storing large amount of unstructured data is solved.The web page data has the characteristics of frequent updating and large quantity. This paper solves this problem very well through the construction of Hadoop distributed cluster and the secondary development of Nutch, a web crawler.The combination of distributed cluster and Nutch solves the difficult problem of data acquisition speed. The application of HBase distributed database makes the storage of large amount of unstructured web page data easy. 2.This paper presents a method of information extraction for primary and secondary school websites.According to the structural characteristics of primary and secondary school web pages, this paper develops a method of extracting web page information by using open source toolkits Jsoup, pattern matching and line block distribution function. Jsoup is mainly used to extract tag information from web pages, such as titlenkeywordsdescription, etc.Pattern matching is mainly used to extract the publishing time of web pages, and the function of row block distribution function is to extract the text of web pages.At the same time, the extracted information will be created for each web page a Java class. 3. 3.The distributed programming model of MapReduce is deeply studied and analyzed.In order to solve the problem of large amount of data calculation, the TF-IDF formula, cosine angle and clustering algorithm are redesigned to run on the MapReduce programming model, which lays the foundation for the whole topic discovery process.Finally, the data of primary and secondary school websites and Chinese educational information websites are tested, and the experimental results are analyzed from the time frequency of topics and the changing trend of topic content.The experimental results show that there is a slight delay in the time point between the educational informatization related topics in the primary and secondary school websites and the Chinese educational informatization websites. At the same time, the content of the topics is more scattered but the overall development trend is consistent.It also shows that the proposed method is effective.
【學位授予單位】：南京師范大學
【學位級別】：碩士
【學位授予年份】：2016
【分類號】：G434
，

本文編號：1759294

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://lk138.cn/jiaoyulunwen/jiaoyutizhilunwen/1759294.html

上一篇：學校教育空間的公共性與私密性
下一篇：論學生的中華文化自覺及教育路徑

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

国产伦乱,一曲二曲欧美日韩,AV在线不卡免费在线不卡免费,搞91AV视频

中小學網站教育信息化話題發(fā)現(xiàn)與趨勢分析