一種基于統(tǒng)計(jì)流形學(xué)習(xí)的文本分類算法

發(fā)布時(shí)間：2018-03-23 17:11

本文選題：文本分類　切入點(diǎn)：流形學(xué)習(xí)　出處：《中國科學(xué)技術(shù)大學(xué)》2017年碩士論文

【摘要】：文本是一種常見的數(shù)據(jù)形式,人們每天都會(huì)使用文本這一信息載體與他人進(jìn)行交流,互聯(lián)網(wǎng)中每時(shí)每刻都會(huì)產(chǎn)生海量的文本數(shù)據(jù)。文本分類在信息檢索、數(shù)據(jù)挖掘、情感分析等任務(wù)上都發(fā)揮著巨大的作用。根據(jù)提取特征方式的不同,文本分類算法可以分為以下三大類:基于統(tǒng)計(jì)的文本分類算法,基于語義相似度的文本分類算法以及基于深度學(xué)習(xí)的文本分類算法。常見的基于統(tǒng)計(jì)的文本分類算法有詞頻-逆向文件頻率模型(TF-IDF),樸素貝葉斯等等。這些方法把詞作為特征項(xiàng),詞出現(xiàn)的次數(shù)作為權(quán)值,然后把文本表示為一個(gè)特征向量,最后使用分類器進(jìn)行文本分類。這些方法假設(shè)相似的文本中有很多相同的詞,但是這忽略了不同詞之間的語義相似性�；谡Z義相似度的文本分類方法通常根據(jù)文本的主題信息對文本的相似度進(jìn)行度量,如主題模型等等,但是這些方法不能夠清晰地捕捉到詞和文本的主題多樣性。近年來,深度學(xué)習(xí)方法吸引了許多研究者的注意,但是這些方法,比如卷積神經(jīng)網(wǎng)絡(luò)或者循環(huán)神經(jīng)網(wǎng)絡(luò)等等,也有著一些不足。比如梯度消失問題以及大規(guī)模參數(shù)訓(xùn)練帶來的時(shí)間消耗。本文提出了一種基于統(tǒng)計(jì)流形學(xué)習(xí)的文本分類算法,它提供了一個(gè)基于潛在主題分布的文本概率模型表示。該模型假設(shè)同一個(gè)主題下的詞服從高斯分布,然后文本被表示為一個(gè)混合高斯模型,最后利用統(tǒng)計(jì)流形學(xué)習(xí)的方法可以對文本之間的距離進(jìn)行度量。本文的主要工作包括:(1)從文本的生成過程出發(fā),提出了一種文本表示的概率模型。每個(gè)主題被表示為一個(gè)高斯分布,文本被表示為一個(gè)高斯混合模型。這種概率模型可以對文本和詞的主題多樣性進(jìn)行較好的描述。(2)通過使用概率模型對文本上的主題分布進(jìn)行描述,文本建模的計(jì)算時(shí)間復(fù)雜度降低為O(n),n是文本中單詞數(shù)量。主題模型中訓(xùn)練速度和語料依賴性的問題得到了改良。(3)通過統(tǒng)計(jì)流形學(xué)習(xí)方法,對文本概率模型的距離進(jìn)行了度量,提供了一種度量概率模型的新思路。(4)在實(shí)驗(yàn)部分,通過三組不同任務(wù)的實(shí)驗(yàn),驗(yàn)證了本文所提算法的有效性以及高斯混合模型描述混合主題下詞向量分布的能力。
[Abstract]:Text is a common data form. People use text to communicate with others every day. Text categorization is used in information retrieval and data mining. According to the different ways of extracting features, text classification algorithms can be divided into the following three categories: statistical based text classification algorithm, Text classification algorithm based on semantic similarity and text classification algorithm based on in-depth learning. Common statistical text classification algorithms include word frequency reverse file frequency model TF-IDFU, naive Bayes and so on. The number of occurrences of a word is used as a weight, then the text is represented as a feature vector, and finally a classifier is used to classify the text. These methods assume that there are many identical words in similar text. But this ignores the semantic similarity between different words. Text classification methods based on semantic similarity usually measure the text similarity according to the subject information of the text, such as topic model, etc. However, these methods can not clearly capture the diversity of words and texts. In recent years, in-depth learning methods have attracted the attention of many researchers, but these methods, such as convolution neural networks or cyclic neural networks, etc. For example, the gradient vanishing problem and the time consumption caused by large-scale parameter training. In this paper, a text classification algorithm based on statistical manifold learning is proposed. It provides a representation of the text probabilistic model based on the distribution of potential topics, which assumes that the words under the same theme are distributed from Gao Si, and then the text is represented as a mixed Gao Si model. Finally, the distance between texts can be measured by using the method of statistical manifold learning. A probabilistic model for text representation is proposed. Each topic is represented as a Gao Si distribution. The text is represented as a Gao Si mixed model. This probability model can describe the topic diversity of the text and word better by using the probabilistic model to describe the topic distribution on the text. The computational time complexity of text modeling is reduced to the number of words in the text. The problem of training speed and corpus dependence in the topic model is improved. The distance of the text probability model is measured by using the statistical manifold learning method. This paper provides a new way to measure the probability model. In the experiment part, the validity of the proposed algorithm and the ability of Gao Si hybrid model to describe the word vector distribution under the mixed theme are verified by three groups of experiments with different tasks.
【學(xué)位授予單位】：中國科學(xué)技術(shù)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 陳敏,湯曉安;在Microsoft Access中引入文本文件[J];微型電腦應(yīng)用;1995年02期

2 李盛瑜;何文;;一種對聊天文本進(jìn)行特征選取的方法研究[J];計(jì)算機(jī)科學(xué);2007年05期

3 蔣志方;祝翠玲;吳強(qiáng);;一個(gè)對不帶類別標(biāo)記文本進(jìn)行分類的方法[J];計(jì)算機(jī)工程;2007年12期

4 趙鋼;;從復(fù)雜文本中導(dǎo)入數(shù)據(jù)的方法[J];中國審計(jì);2007年18期

5 易樹鴻;張為群;;一種基于粗集的文本數(shù)據(jù)特征信息的挖掘方法[J];計(jì)算機(jī)科學(xué);2002年08期

6 李建中,楊艷,張艷秋;并行文本管理原型系統(tǒng)PDoc的功能與總體框架[J];哈爾濱工業(yè)大學(xué)學(xué)報(bào);2004年09期

7 覃曉;元昌安;彭昱忠;丁超;;基于基因表達(dá)式編程的Web文本分類研究[J];網(wǎng)絡(luò)安全技術(shù)與應(yīng)用;2009年03期

8 諶志群;;文本趨勢挖掘綜述[J];情報(bào)科學(xué);2010年02期

9 王亞民;劉洋;;含附件文本的分類算法研究[J];情報(bào)雜志;2012年08期

10 江偉;潘昊;;基于優(yōu)化的多核學(xué)習(xí)方法的Web文本分類的研究[J];計(jì)算機(jī)技術(shù)與發(fā)展;2013年10期

相關(guān)會(huì)議論文前10條

1 許君;王朝坤;劉立超;王建民;劉璋;;云環(huán)境中的近似復(fù)制文本檢測[A];第29屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（B輯）（NDBC2012）[C];2012年

2 易天元;葉春生;;工業(yè)鍋爐圖紙輸入的文本數(shù)據(jù)處理[A];1997中國控制與決策學(xué)術(shù)年會(huì)論文集[C];1997年

3 胡蓉;唐常杰;陳敏敏;欒江;;關(guān)聯(lián)規(guī)則制導(dǎo)的遺傳算法在文本分類中的應(yīng)用[A];第十九屆全國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2002年

4 李文波;孫樂;黃瑞紅;馮元勇;張大鯤;;基于Labeled-LDA模型的文本分類新算法[A];第三屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集[C];2007年

5 黃云平;孫樂;李文波;;基于上下文圖模型文本表示的文本分類研究[A];第四屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集（上）[C];2008年

6 蔣勇;陳曉靜;;一種多方向手寫文本行提取方法[A];第二十七屆中國控制會(huì)議論文集[C];2008年

7 李瑞;王朝坤;鄭偉;王建民;王偉平;;基于MapReduce框架的近似復(fù)制文本檢測[A];NDBC2010第27屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集(B輯)[C];2010年

8 胡俊;黃厚寬;;一種基于SVM的可視化文本分類的方法[A];第二十一屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（技術(shù)報(bào)告篇）[C];2004年

9 勞錦明;韋崗;;文本壓縮技術(shù)研究的新進(jìn)展[A];開創(chuàng)新世紀(jì)的通信技術(shù)——第七屆全國青年通信學(xué)術(shù)會(huì)議論文集[C];2001年

10 江荻;;藏語文本信息處理的歷程與進(jìn)展[A];中文信息處理前沿進(jìn)展——中國中文信息學(xué)會(huì)二十五周年學(xué)術(shù)會(huì)議論文集[C];2006年

相關(guān)重要報(bào)紙文章前1條

1 戴洪玲;向Excel中快速輸入相同文本[N];中國電腦教育報(bào);2004年

相關(guān)博士學(xué)位論文前10條

1 宋歌;基于聚類森林的文本流分類方法研究[D];哈爾濱工業(yè)大學(xué);2014年

2 韓開旭;基于支持向量機(jī)的文本情感分析研究[D];東北石油大學(xué);2014年

3 鄭立洲;短文本信息抽取若干技術(shù)研究[D];中國科學(xué)技術(shù)大學(xué);2016年

4 韓磊;漢語句義結(jié)構(gòu)模型分析及其文本表示方法研究[D];北京理工大學(xué);2016年

5 劉林;面向論壇文本的大學(xué)生情緒識別研究[D];華中師范大學(xué);2016年

6 張博宇;基于局部特征的場景文本分析方法研究[D];哈爾濱工業(yè)大學(xué);2015年

7 胡卉芪;空間文本數(shù)據(jù)的量質(zhì)融合與推送[D];清華大學(xué);2016年

8 胡明涵;面向領(lǐng)域的文本分類與挖掘關(guān)鍵技術(shù)研究[D];東北大學(xué)　;2009年

9 孫曉華;基于聚類的文本機(jī)會(huì)發(fā)現(xiàn)關(guān)鍵問題研究[D];哈爾濱工程大學(xué);2010年

10 尚文倩;文本分類及其相關(guān)技術(shù)研究[D];北京交通大學(xué);2007年

相關(guān)碩士學(xué)位論文前10條

1 王軼霞;基于半監(jiān)督遞歸自編碼的情感分類研究[D];內(nèi)蒙古大學(xué);2015年

2 金傳鑫;氣象文本分類特征選擇方法及其在MapReduce上的實(shí)現(xiàn)[D];南京信息工程大學(xué);2015年

3 李少卿;不良文本及其變體信息的檢測過濾技術(shù)研究[D];復(fù)旦大學(xué);2014年

4 董秦濤;基于文本的個(gè)人情感狀態(tài)分析研究[D];蘭州大學(xué);2015年

5 鐘文波;搜索引擎中關(guān)鍵詞分類方法評估及推薦應(yīng)用[D];華南理工大學(xué);2015年

6 黃晨;基于新詞識別和時(shí)間跨度的微博熱點(diǎn)研究[D];上海交通大學(xué);2015年

7 陳紅陽;中文微博話題發(fā)現(xiàn)技術(shù)研究[D];重慶理工大學(xué);2015年

8 王s，

本文編號：1654398

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.lk138.cn/shoufeilunwen/xixikjs/1654398.html

上一篇：基于手機(jī)藍(lán)牙技術(shù)的網(wǎng)絡(luò)化門禁系統(tǒng)研究
下一篇：超聲波束形成技術(shù)仿真研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

中国韩国日本在线观看免费,A级尤物一区,日韩精品一二三区无码,欧美日韩少妇色

一種基于統(tǒng)計(jì)流形學(xué)習(xí)的文本分類算法