中国韩国日本在线观看免费,A级尤物一区,日韩精品一二三区无码,欧美日韩少妇色

基于語(yǔ)義的文本聚類(lèi)算法研究

發(fā)布時(shí)間:2018-04-09 17:44

  本文選題:文本聚類(lèi) 切入點(diǎn):連續(xù)詞向量 出處:《北京交通大學(xué)》2017年碩士論文


【摘要】:隨著信息技術(shù)的飛速發(fā)展,網(wǎng)絡(luò)數(shù)據(jù)呈現(xiàn)指數(shù)級(jí)增長(zhǎng)。如何快速、準(zhǔn)確地從海量網(wǎng)絡(luò)資源中篩選出目標(biāo)信息,已成為人們面臨的重要問(wèn)題之一。文本聚類(lèi)作為涵蓋了數(shù)據(jù)挖掘、機(jī)器學(xué)習(xí)和自然語(yǔ)言處理等領(lǐng)域的一種重要的文本挖掘技術(shù)在這樣的時(shí)代背景下應(yīng)運(yùn)而生。向量空間模型由于其簡(jiǎn)便、高效性而被廣泛應(yīng)用于文本聚類(lèi)研究中,然而,由于傳統(tǒng)的向量空間模型直接將文本中的詞作為文本表示的特征,忽略了詞間可能存在的語(yǔ)義關(guān)系,進(jìn)而導(dǎo)致文本語(yǔ)義信息丟失的問(wèn)題。針對(duì)這一問(wèn)題,一些學(xué)者提出通過(guò)語(yǔ)義消歧的方式將文本中的詞映射至WordNet中與其詞義對(duì)應(yīng)的概念,來(lái)識(shí)別文本中的歧義詞和同義詞。通過(guò)對(duì)這些方法的分析,我們發(fā)現(xiàn)其語(yǔ)義消歧策略存在一些不足的地方,由此,本文提出了一種基于連續(xù)詞向量的語(yǔ)義消歧算法,該算法探索運(yùn)用神經(jīng)網(wǎng)絡(luò)語(yǔ)言模型來(lái)深度挖掘概念與上下文間的語(yǔ)義相似度信息,進(jìn)而提高語(yǔ)義消歧的準(zhǔn)確性。通過(guò)將該算法應(yīng)用于文本聚類(lèi)分析,本文實(shí)現(xiàn)了一種基于連續(xù)詞向量語(yǔ)義消歧的文本聚類(lèi)算法。由于本體WordNet中包含有大量的語(yǔ)義知識(shí),且被以結(jié)構(gòu)化的形式進(jìn)行組織,一些旨在豐富文本語(yǔ)義表達(dá)、基于WordNet的文本表示方法被相繼提出,并應(yīng)用于文本聚類(lèi)研究中。然而,由于文本數(shù)據(jù)語(yǔ)義信息的復(fù)雜、多樣性,且WordNet中概念多達(dá)十萬(wàn)個(gè),因此這類(lèi)方法普遍存在文本向量維度過(guò)高的問(wèn)題。針對(duì)這一問(wèn)題,本文提出了一種基于概念簇的特征降維算法,旨在通過(guò)概念聚類(lèi)來(lái)對(duì)文本進(jìn)行粗粒度特征抽取,從而達(dá)到降低文本表示維度的目的。在該算法中,最棘手,同時(shí)也是最關(guān)鍵的一個(gè)問(wèn)題是如何獲取概念的語(yǔ)義表示,以用于后續(xù)概念聚類(lèi)分析。本文基于神經(jīng)網(wǎng)絡(luò)語(yǔ)言模型在語(yǔ)義特征抽取研究中的有效性,探索將WordNet中概念間的釋義關(guān)系編碼至一個(gè)概念語(yǔ)料庫(kù)中,并利用神經(jīng)網(wǎng)絡(luò)語(yǔ)言模型基于概念在該語(yǔ)料庫(kù)中的共現(xiàn)情況來(lái)學(xué)習(xí)概念的語(yǔ)義表示。通過(guò)結(jié)合本文提出的基于連續(xù)詞向量的語(yǔ)義消歧算法與基于概念簇的特征降維算法,本文實(shí)現(xiàn)了一種基于連續(xù)詞向量和概念簇的文本聚類(lèi)算法,旨在提升文本聚類(lèi)準(zhǔn)確性的同時(shí),提高聚類(lèi)算法的效率。通過(guò)與若干經(jīng)典文本聚類(lèi)算法的實(shí)驗(yàn)比較,我們發(fā)現(xiàn),本文提出的文本聚類(lèi)算法不僅能有效提高文本聚類(lèi)的準(zhǔn)確性,而且很好的解決了文本表示高維度問(wèn)題。
[Abstract]:With the rapid development of information technology, network data presents exponential growth.How to quickly and accurately screen out the target information from massive network resources has become one of the important problems that people are facing.Text clustering is an important text mining technology which covers the fields of data mining, machine learning and natural language processing.Vector space model is widely used in text clustering research because of its simplicity and efficiency. However, because the traditional vector space model directly takes the words in the text as the feature of text representation, it ignores the semantic relations that may exist between words.Then it leads to the loss of text semantic information.To solve this problem, some scholars have proposed to identify ambiguous words and synonyms in the text by semantic disambiguation by mapping the words in the text to the concepts corresponding to their meanings in WordNet.Through the analysis of these methods, we find that there are some shortcomings in the semantic disambiguation strategy. Therefore, a semantic disambiguation algorithm based on continuous word vector is proposed in this paper.The algorithm explores the use of neural network language model to deeply mine semantic similarity information between concepts and contexts, thus improving the accuracy of semantic disambiguation.By applying this algorithm to text clustering analysis, a text clustering algorithm based on continuous word vector semantic disambiguation is implemented in this paper.Because ontology WordNet contains a lot of semantic knowledge and is organized in a structured form, some text representation methods based on WordNet have been proposed and applied to text clustering research.However, due to the complexity and diversity of semantic information of text data and the fact that there are as many as 100, 000 concepts in WordNet, this kind of method generally exists the problem of high dimension of text vector.To solve this problem, a feature reduction algorithm based on concept cluster is proposed in this paper, which aims to extract coarse-grained feature of text through concept clustering, so as to reduce the dimensionality of text representation.In this algorithm, one of the most difficult and crucial problems is how to obtain the semantic representation of concepts for subsequent conceptual clustering analysis.Based on the validity of neural network language model in semantic feature extraction, this paper explores how to encode the definitions of concepts in WordNet into a concept corpus.The neural network language model is used to study the semantic representation of concepts based on the co-occurrence of concepts in the corpus.By combining the semantic disambiguation algorithm based on continuous word vector and the feature dimension reduction algorithm based on concept cluster, a text clustering algorithm based on continuous word vector and concept cluster is implemented in this paper.In order to improve the accuracy of text clustering and improve the efficiency of clustering algorithm.By comparing with some classical text clustering algorithms, we find that the proposed text clustering algorithm can not only effectively improve the accuracy of text clustering, but also solve the problem of high dimension of text representation.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 洪韻佳;許鑫;;基于領(lǐng)域本體的知識(shí)庫(kù)多層次文本聚類(lèi)研究——以中華烹飪文化知識(shí)庫(kù)為例[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2013年12期

2 張黎;徐蔚然;;中文分詞研究[J];軟件;2012年12期

3 張玉峰;何超;王志芳;周磊;;融合語(yǔ)義聚類(lèi)的企業(yè)競(jìng)爭(zhēng)力影響因素分析研究[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2012年09期

4 張丹;;中文分詞算法綜述[J];黑龍江科技信息;2012年08期

5 白旭;靳志軍;;K-中心點(diǎn)聚類(lèi)算法優(yōu)化模型的仿真研究[J];計(jì)算機(jī)仿真;2011年01期

6 王剛;邱玉輝;;基于本體及相似度的文本聚類(lèi)研究[J];計(jì)算機(jī)應(yīng)用研究;2010年07期

7 呂剛;鄭誠(chéng);;基于加權(quán)的本體相似度計(jì)算方法[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年05期

8 趙捧未;袁穎;;基于領(lǐng)域本體的語(yǔ)義相似度計(jì)算方法研究[J];科技情報(bào)開(kāi)發(fā)與經(jīng)濟(jì);2010年08期

9 孫海霞;錢(qián)慶;成穎;;基于本體的語(yǔ)義相似度計(jì)算方法研究綜述[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2010年01期

10 王剛;邱玉輝;蒲國(guó)林;;一個(gè)基于語(yǔ)義元的相似度計(jì)算方法研究[J];計(jì)算機(jī)應(yīng)用研究;2008年11期

相關(guān)碩士學(xué)位論文 前5條

1 李雷;基于人工智能機(jī)器學(xué)習(xí)的文字識(shí)別方法研究[D];電子科技大學(xué);2013年

2 曹巧玲;基于網(wǎng)格的聚類(lèi)融合算法的研究[D];鄭州大學(xué);2011年

3 張睿;基于k-means的中文文本聚類(lèi)算法的研究與實(shí)現(xiàn)[D];西北大學(xué);2009年

4 鄭韞e,

本文編號(hào):1727475


資料下載
論文發(fā)表

本文鏈接:http://www.lk138.cn/shoufeilunwen/xixikjs/1727475.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)19823***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com