雙字低頻未登錄詞識別研究
本文關(guān)鍵詞: 低頻 雙字 未登錄詞 素性 網(wǎng)絡(luò)檢索 出處:《南京師范大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:未登錄詞是影響中文自動分詞精度的最主要原因,低頻詞是未登錄詞識別的難點,而雙字低頻未登錄詞又是低頻未登錄詞的重要組成。所以,文章著重研究如何高效識別雙字低頻未登錄詞,選用多種統(tǒng)計和規(guī)則相結(jié)合的方法,取得了一定的效果。 在識別低頻雙字未登錄詞的過程中,為了提高識別效率并對實驗結(jié)果進(jìn)行有效統(tǒng)計研究,我們進(jìn)行前期處理,主要分為三步:一、分詞并提取分詞碎片。二、識別未登錄詞中的重要組成——命名實體。三、識別部分多字未登錄詞。之后我們在所剩碎片中判別低頻雙字未登錄詞,采用了多種統(tǒng)計與規(guī)則相結(jié)合的辦法,有互信息、成詞非詞概率、鄰字熵、素性組合。雖然實驗結(jié)果一般,但在輔助識別、提取新詞上依然具有實用的價值,可以為人工識別減輕大量負(fù)擔(dān)。我們在識別過程中發(fā)現(xiàn),詞定義的模糊性、語料中分詞不一致是雙字未登錄詞難以正確識別的重要原因,因此,我們對此進(jìn)行了深入的研究,提出了對雙字詞的新的合理定義。之后,我們自己標(biāo)注了小型的測試語料,在同樣的識別方法下,正確率和召回率都有較大提高。最后我們還提出并實現(xiàn)了一種基于網(wǎng)絡(luò)的判別方法,對“結(jié)合緊密、使用穩(wěn)定”這一屬性進(jìn)行了量化,該方法在判定雙字低頻未登錄詞的實驗中表現(xiàn)出色,F值最高達(dá)到了86%?梢,使用網(wǎng)絡(luò)資源可能是提高自動分詞、特別是未登錄詞自動識別效果的突破口。
[Abstract]:The unrecorded word is the main reason that affects the precision of Chinese automatic word segmentation, the low frequency word is the difficulty of identifying the unrecorded word, and the double word low frequency unrecorded word is the important component of the low frequency unrecorded word. This paper focuses on how to efficiently identify low frequency unrecorded words with double characters and select a variety of methods combining statistics and rules to achieve certain results. In the process of identifying low-frequency double-word unrecorded words, in order to improve the efficiency of recognition and carry on the effective statistical research on the experimental results, we carry out preliminary processing, mainly divided into three steps: first, participle and extract the fragment of participle. Identify the important component of the unrecorded word named entity. Third, identify part of the multi-word unentered word. Then we distinguish the low-frequency double-word word from the remaining fragments. We adopt a variety of methods combining statistics and rules, and have mutual information. Although the experimental results are general, they still have practical value in auxiliary recognition and extraction of new words, which can lighten a large amount of burden for manual recognition. The ambiguity of the definition of words and the inconsistent segmentation in the corpus are the important reasons why it is difficult to recognize the double-character unrecorded words correctly. Therefore, we have made a deep research on this and put forward a new and reasonable definition of double-character words. We annotate the small test corpus, and under the same recognition method, the correct rate and recall rate are improved greatly. Finally, we propose and implement a network-based discriminant method. This method has been quantized by using the attribute of "stable". This method has performed well in the experiment of judging double-character low-frequency unrecorded words, and the highest F value has reached 860.It can be seen that the use of network resources may be to improve the automatic word segmentation. Especially the breakthrough of automatic recognition effect of unrecorded words.
【學(xué)位授予單位】:南京師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:H08
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 宋作艷;;字族化與漢語未登錄詞的自動提取[J];北京大學(xué)學(xué)報(哲學(xué)社會科學(xué)版);2007年02期
2 胡俊峰,俞士汶;唐宋詩之計算機輔助深層研究[J];北京大學(xué)學(xué)報(自然科學(xué)版);2001年05期
3 羅智勇;宋柔;;基于多特征的自適應(yīng)新詞識別[J];北京工業(yè)大學(xué)學(xué)報;2007年07期
4 朱靖波,張玫杰,姚天順;一種基于NA假設(shè)的訓(xùn)練數(shù)據(jù)自動構(gòu)造方法[J];東北大學(xué)學(xué)報;1999年04期
5 孫茂松,鄒嘉彥;漢語自動分詞研究評述[J];當(dāng)代語言學(xué);2001年01期
6 侯漢清,薛鵬軍;基于知識庫的網(wǎng)頁自動標(biāo)引和自動分類系統(tǒng)的設(shè)計[J];大學(xué)圖書館學(xué)報;2004年01期
7 馬穎華,王永成,蘇貴洋;一種在漢語文本中抽取重復(fù)字串的快速算法[J];電子學(xué)報;2002年S1期
8 呂學(xué)強,張樂,黃志丹,胡俊峰;基于散列技術(shù)的快速子串歸并算法[J];復(fù)旦學(xué)報(自然科學(xué)版);2004年05期
9 胡婕;李躍新;;數(shù)據(jù)庫受限漢語自然語言查詢的分詞研究與實現(xiàn)[J];湖北大學(xué)學(xué)報(自然科學(xué)版);2005年04期
10 馬光志,李專;基于特征詞的自動分詞研究[J];華中科技大學(xué)學(xué)報(自然科學(xué)版);2003年03期
,本文編號:1503126
本文鏈接:http://lk138.cn/wenyilunwen/hanyulw/1503126.html