【摘要】:隨著互聯(lián)網(wǎng)的發(fā)展,越來越多的網(wǎng)民習(xí)慣從互聯(lián)網(wǎng)獲取信息,越來越多的企業(yè)開始試圖從網(wǎng)絡(luò)中獲取經(jīng)驗相關(guān)的信息;ヂ(lián)網(wǎng)已經(jīng)成為繼報紙,廣播,電視之后的“第四媒體”。互聯(lián)網(wǎng)因其便捷性,成為人們獲取信息的首要來源。同時,多種社交媒體的出現(xiàn),例如微博,朋友圈,facebook,twitter的出現(xiàn),使人們可以大量發(fā)表自己觀點。這些觀點,對于企業(yè)有著重要的意義。這些觀點可以幫助企業(yè)知道用戶對其商品的觀點,可以幫助企業(yè)知道其對手對自己商品的觀點。這些信息可以幫助電影院進(jìn)行電影票房預(yù)測。同時,這些信息也可以幫助人們更好了解自己生活的輿論等。情感分析(sentiment analysis)就是用來完成以上任務(wù)的一種技術(shù)。情感分析主要是用來解決誰對什么東西的什么方面有什么觀點。涉及主體——人,客體——特征,觀點——情感詞等。情感分析)又被稱為觀點發(fā)現(xiàn)(opinion find)。是從大量文本中找到主觀信息。例如,某人關(guān)于某事物的評價。某人對于某個觀點的意見等。其中,情感詞庫建立是情感分析的重要組成部分。 本文主要研究兩個問題:第一,情感詞庫是與特定領(lǐng)域相關(guān)的,不同領(lǐng)域的情感詞庫具有明顯不同。同一個詞匯,在不同情感詞庫中,可能有著不同的情感色彩。如何自動化的建立一個金融情感詞庫呢?第二,情感詞庫的所有情感詞匯并不是都具有相同的情感色彩,如何對這些情感詞進(jìn)行排名呢? 本文將自然語言處理技術(shù)與金融相關(guān)技術(shù)結(jié)合,力圖解決以上問題。首先,本文研究了基礎(chǔ)了自然語言處理技術(shù);然后建立了基于以上理論基礎(chǔ)的系統(tǒng)。最后通過實驗,研究不同參數(shù)對于情感詞庫研究的影響。 論文主要包括五個章節(jié)的內(nèi)容: 第一章,緒論。介紹了國外相關(guān)學(xué)者對于本課題的研究現(xiàn)狀。闡述了本文的研究方法和研究思路。 第二章,相關(guān)知識。介紹了常用的自然語言處理技術(shù)。常用的文本分類技術(shù)以及其數(shù)學(xué)原理。 第三章,系統(tǒng)實現(xiàn)。介紹了本系統(tǒng)的開發(fā)與實現(xiàn)。介紹了基于lucene的整體系統(tǒng)開發(fā),分詞,索引,以及文本自動生成技術(shù)。 第四章,算法與實驗,本部分闡述了基于PLSA的Trend-PLSA算法。詞算法將趨勢與PLSA進(jìn)行融合,將元數(shù)據(jù)與概率圖模型相結(jié)合,從而提高情感詞庫的正確率。最后,本部分闡述了不同實驗參數(shù)對于情感詞庫建立的影響。 第五章,總結(jié)和期望。首先總結(jié)了本文的主要工作,以及本文的主要貢獻(xiàn)。最后提出了未來研究的新方向和新思路。 本文采用如下技術(shù)進(jìn)行研究: 首先,本文采用了自然語言處理技術(shù)。自然語言處理技術(shù)是一門計算機(jī)與語言學(xué)相結(jié)合的交叉學(xué)科。自然語言處理技術(shù)致力于讓機(jī)器理解人類的語言,如TF-IDF求值,主題模型,文本向量化方法,索引建立等。 其次,本文采用了定性與定量相結(jié)合的技術(shù)。本文所研究的對象是情感分析。情感詞歸類本身屬于一個定性的問題,將給定的詞匯歸屬到指定類中。對于給定的情感詞找到所屬的情感類型即可。同時,本文也給每個情感詞一個定量的數(shù)值,對所有的情感詞進(jìn)行排序,這個值的絕對值越大表明情感詞的感情色彩越強(qiáng)。本文處理的股價信息是一個定量的數(shù)據(jù),通過相關(guān)算法,本文把定量的數(shù)據(jù)轉(zhuǎn)化為定性的信息,通過這樣定性的信息,進(jìn)行情感詞判斷?傊,通過定性與定量相結(jié)合的方法,提高了情感詞庫的正確性,也提高了情感詞庫的實用性。 通過實現(xiàn),本文發(fā)現(xiàn),本文所提出的情感詞生成算法具有較強(qiáng)的實用性。相比其他的情感詞提取算法,本文提出的情感詞生成算法正確率較高。 本文的創(chuàng)新之處,可以通過如下方面進(jìn)行闡述。本文的創(chuàng)新之處主要是算法和技術(shù)上的創(chuàng)新。 首先,本文不需要預(yù)先選定種子詞匯,所謂的種子詞匯,就是預(yù)先選擇的詞匯。情感詞庫常規(guī)生成方法,要先選定若干的種子詞匯。如果沒有良好的種子詞匯,所有的情感詞庫只能是水中花,鏡中月。優(yōu)秀的種子詞匯,是高質(zhì)量情感詞庫生成的保證。好的情感詞庫使得情感詞庫具有較強(qiáng)的泛化能力。對于特定領(lǐng)域的情感詞庫建立,“種子”詞匯的選擇需要選擇者具有很好的專家素養(yǎng)。從經(jīng)濟(jì)角度分析,雇傭這些專家來進(jìn)行種子詞匯挑選的費用也是相當(dāng)昂貴的。同時,這些詞匯應(yīng)當(dāng)具有普遍性,有較強(qiáng)的情感詞性。但這兩者通常是互相矛盾的,這樣的任務(wù)對于專家而言也并不是一項輕易的工作而本文所提出的算法,是一種非監(jiān)督式學(xué)習(xí)的算法,這種算法不需要預(yù)先知道任何與情感有關(guān)的詞匯。即不需要知道種子詞匯。從而大大減少了情感詞庫建立的費用,加速了情感詞庫生成的速度。 其次,詞語的情感性是隨著時間變化而變化的,新的情感詞不斷涌現(xiàn)。舊的詞匯又會有新的情感詞性,F(xiàn)有的算法不具有這種隨時間變化而自動變化的自適應(yīng)能力。本文所設(shè)計的系統(tǒng),可以不斷的從網(wǎng)上獲取股價數(shù)據(jù),自動的將股價數(shù)據(jù)與文本進(jìn)行匹配,從而可以隨時間變化不斷生成新的情感詞。這樣生成的情感詞庫具有很強(qiáng)的時效性。 然后,同一個詞匯在不同領(lǐng)域中具有不同的情感色彩。不同領(lǐng)域的情感詞有著不同的排名。本文通過排序算法,對所有的情感詞進(jìn)行了排序。 最后,本文提出了基于隱含語義分析算法的趨勢-隱含語義分析算法。本文實驗了簡單貝葉斯算法。對比了簡單貝葉斯算法和隱含語義分析算法的實驗效果。實現(xiàn)結(jié)果顯示,本算法相比其他算法相比,能更好的利用股價信息,從而做出更準(zhǔn)確的情感詞歸類,構(gòu)建更為優(yōu)秀的情感詞庫。
[Abstract]:With the development of the Internet, more and more netizens get used to obtain information from the Internet. More and more enterprises have begun to try to obtain the information related to the Internet. The Internet has become the "fourth media" after the newspaper, radio and television. The Internet has become the primary source of information for people. The emergence of social media, such as micro-blog, circle of friends, Facebook, and twitter, makes it possible for a large number of people to publish their views. These ideas are important to the business. These ideas help companies to know their views on their goods and help the business know their opponents' views on their goods. To help the cinema to make a movie box office prediction. At the same time, the information can also help people to better understand the public opinion of their lives. Sentiment analysis is a technique used to accomplish the above tasks. People, objects, features, opinions, emotional words, emotional analysis, and emotional analysis are also known as opinion find. It is to find subjective information from a large number of texts. For example, a person's evaluation of something. Someone's opinion on a point of view. Among them, the establishment of an emotional lexicon is an important part of the emotional analysis.
This paper mainly studies two questions: first, the emotional lexicon is related to a particular field. The emotional lexicon in different fields is distinctly different. The same word, in the different emotional lexicon, may have different emotional colors. How to automate the establishment of a financial emotional word library? Second, all emotional words are not in the emotional lexicon. All have the same emotional color, how to rank these emotional words?
In this paper, Natural Language Processing technology and financial related technology are combined to solve the above problems. First, this paper studies the foundation of Natural Language Processing technology, and then establishes a system based on the above theoretical basis. Finally, through experiments, the influence of different parameters on the research of emotional lexicon is studied.
This paper mainly includes five chapters:
The first chapter, introduction, introduces the research status of foreign scholars on this topic, and expounds the research methods and research ideas of this paper.
The second chapter, related knowledge, introduces the commonly used Natural Language Processing technology, the commonly used text classification technology and its mathematical principle.
The third chapter, system implementation, introduces the development and implementation of the system. It introduces the overall system development, segmentation, indexing, and text automatic generation technology based on Lucene.
The fourth chapter, algorithm and experiment, this part expounds the Trend-PLSA algorithm based on PLSA. The word algorithm combines the trend with the PLSA, and combines the metadata with the probability map model, thus improving the correct rate of the emotional lexicon. Finally, this part expounds the influence of different experimental parameters on the establishment of emotional lexicon.
The fifth chapter summarizes and expects. First, it summarizes the main work of this paper and the main contributions of this paper. Finally, it puts forward new directions and new ideas for future research.
This paper studies the following techniques:
First of all, this article uses Natural Language Processing technology. Natural Language Processing technology is a cross subject that combines computer and linguistics. Natural Language Processing technology is committed to making machines understand human language, such as TF-IDF evaluation, theme model, text to quantization method, cable indexing and so on.
Secondly, this paper uses a combination of qualitative and quantitative techniques. The object of this paper is emotional analysis. The classification of emotional words itself belongs to a qualitative problem, which belongs to a given class. The emotional type of a given emotion word can be found. At the same time, this article also gives each emotional word a quantitative value. The greater the absolute value of the value, the greater the absolute value of the value indicates that the emotional color is stronger. The stock price information dealt with in this article is a quantitative data. Through the relevant algorithms, the quantitative data is converted into qualitative information and the qualitative information is used to judge the emotional words. In a word, the qualitative and quantitative phases are made. The combination method improves the correctness of emotional lexicon and improves the practicability of emotional lexicon.
Through the implementation, this paper finds that the algorithm proposed in this paper is more practical. Compared with other affective word extraction algorithms, the algorithm proposed in this paper has a higher accuracy.
The innovation of this paper can be explained through the following aspects. The innovation of this article is mainly the innovation of algorithm and technology.
First, this article does not need to choose seed words in advance. The so-called seed vocabulary is a preselected vocabulary. The common generation method of emotional lexicon is to select a number of seed words. If there is no good seed vocabulary, all the emotional lexicon can only be water flower, mirror moon. Excellent seed vocabulary, high quality emotional lexicon generation. Guarantee. Good emotional lexicon makes the emotional lexicon highly generalization. For the establishment of a particular domain of emotional lexicon, the choice of "seed" vocabulary needs a good expert attainment. From an economic perspective, the cost of hiring these experts for seed vocabulary selection is also quite expensive. Remittance should be universal and have strong emotional words. But the two are usually contradictory, and such a task is not an easy task for experts. The algorithm proposed in this paper is an unsupervised learning algorithm, which does not need to know any emotion related vocabulary in advance. That is, it is not necessary to know. Thus, the cost of establishing emotional lexicon is greatly reduced, and the speed of generating emotional lexicon is accelerated.
Secondly, the emotion of the words is changed with time, the new emotion words are constantly emerging. The old words will have new emotional words. The existing algorithms do not have the self-adaptive ability to change automatically with time. The system designed in this paper can continuously obtain stock data from the Internet and automatically make the stock price data. Matching with the text, it can generate new emotional words over time. This generated emotional lexicon has a strong timeliness.
Then, the same word has different emotional colors in different fields. The emotion words in different fields have different ranking. In this paper, all the emotional words are sorted by sorting algorithm.
Finally, this paper puts forward the trend implicit semantic analysis algorithm based on the implicit semantic analysis algorithm. In this paper, the simple Bias algorithm is experimented. The experimental results of the simple Bias algorithm and the implicit semantic analysis algorithm are compared. The results show that the algorithm can make better use of the stock price information compared with other algorithms and make the more accurate. Classify the emotional words and construct a better emotional lexicon.


