基于用戶行為分析的網(wǎng)頁分類系統(tǒng)的研究與實現(xiàn)
本文關(guān)鍵詞:基于用戶行為分析的網(wǎng)頁分類系統(tǒng)的研究與實現(xiàn) 出處:《北京郵電大學(xué)》2011年碩士論文 論文類型:學(xué)位論文
更多相關(guān)文章: 用戶行為分析 網(wǎng)頁自動分類 中文分詞 CHI統(tǒng)計 SVM
【摘要】:近年來,隨著因特網(wǎng)飛速發(fā)展,網(wǎng)絡(luò)上以網(wǎng)頁為載體的各種文本信息大量涌現(xiàn),網(wǎng)上信息量呈爆炸性增長。人們要找到自己所需要的信息猶如大海撈針,被動模式的搜索引擎已經(jīng)不能滿足用戶的需求。如何以主動模式滿足用戶個性化的服務(wù)要求,成為新的網(wǎng)絡(luò)服務(wù)系統(tǒng)面臨的挑戰(zhàn)性課題之一。本文基于用戶行為分析和個性化服務(wù)的前提,針對網(wǎng)頁分類技術(shù)中的關(guān)鍵技術(shù)進(jìn)行研究和改進(jìn),最終實現(xiàn)了一個適應(yīng)于網(wǎng)頁分類的文本分類系統(tǒng)。本文主要研究的關(guān)鍵技術(shù)包括: 第一,中文分詞技術(shù)。本文對原有的分詞方法進(jìn)行研究,并提出了一種適合于網(wǎng)頁文本特點的基于統(tǒng)計與最大匹配結(jié)合的分詞算法,該方法能識別出網(wǎng)頁中的新生詞匯,且合并頻繁出現(xiàn)的單字組合。改進(jìn)的方法既避免了漏掉對分類有很大貢獻(xiàn)的新生詞匯,也通過合并單字減小了特征空間維數(shù),降低了計算復(fù)雜度。 第二,特征抽取和賦權(quán)技術(shù)。本文通過研究和考察特征選擇算法和賦權(quán)算法,對普遍認(rèn)為效果較好的CHI統(tǒng)計方法進(jìn)行了適合于網(wǎng)頁分類的改進(jìn),提出了基于網(wǎng)頁結(jié)構(gòu)的CHI統(tǒng)計特征選擇算法和TD-IDF-CHI賦權(quán)算法。實驗結(jié)果表明,這兩種預(yù)處理算法在一定程度上提高了分類精度。 本文基于以上改進(jìn)的算法實現(xiàn)了一個網(wǎng)頁分類模塊,同時也設(shè)計并實現(xiàn)了一個完整的用戶行為分析系統(tǒng),該系統(tǒng)主要包括三大模塊:數(shù)據(jù)采集過濾模塊、網(wǎng)頁分類模塊和結(jié)果統(tǒng)計模塊。三大模塊所完成的功能如下: 第一,數(shù)據(jù)采集過濾模塊。Web行為的用戶屬性信息存在于HTTP包的頭部,要獲得用戶的信息就需要對HTTP包進(jìn)行解析和信息提取。數(shù)據(jù)采集過濾模塊中介紹了本文所設(shè)計實現(xiàn)的HTTP包解析的流程。 第二,網(wǎng)頁分類模塊是本文主要的研究對象。該模塊基于改進(jìn)的分詞算法、預(yù)處理算法和分類效果較好的KNN和SVM分類算法,實現(xiàn)了將網(wǎng)頁映射到特定類別的過程。 第三,結(jié)果統(tǒng)計模塊。該模塊總結(jié)并更新用戶訪問的網(wǎng)頁的分類結(jié)果,并與個性化服務(wù)系統(tǒng)直接相連,將用戶行為分析的結(jié)果直接應(yīng)用于個性化廣告反饋等服務(wù)中去。 本文所研究并實現(xiàn)的基于用戶行為分析的網(wǎng)頁分類系統(tǒng)適用于網(wǎng)頁在線分類和離線分類兩種模式,實驗結(jié)果表明,改進(jìn)的預(yù)處理算法對分類準(zhǔn)確度有很好的矯正,結(jié)果統(tǒng)計模塊的設(shè)計也獲得了較好的結(jié)果,充分反映了用戶當(dāng)前的興趣,為個性化服務(wù)系統(tǒng)的研究提供了參考模型。
[Abstract]:In recent years, with the rapid development of the Internet, a large number of text information based on web pages has emerged, and the amount of information on the Internet has increased explosively. People want to find the information they need is like looking for a needle in a haystack. Passive search engine can not meet the needs of users. How to use active mode to meet the user's personalized service requirements. Based on the premise of user behavior analysis and personalized service, this paper studies and improves the key technologies of web page classification technology. Finally, a text classification system suitable for web page classification is implemented. The key technologies of this paper include: First, the Chinese word segmentation technology. This paper studies the original word segmentation methods, and proposes a word segmentation algorithm based on the combination of statistics and maximum matching. This method can recognize the new words in the web pages and combine the frequent word combinations. The improved method not only avoids the omission of the new vocabulary which has a great contribution to the classification. The dimension of feature space is reduced by combining words, and the computational complexity is reduced. Secondly, feature extraction and weighting techniques. Through the research and investigation of feature selection algorithm and weighting algorithm, the CHI statistical method, which is generally considered to be effective, is improved for web page classification. CHI statistical feature selection algorithm and TD-IDF-CHI weighting algorithm based on web structure are proposed. The experimental results show that the two preprocessing algorithms improve the classification accuracy to some extent. This paper implements a web page classification module based on the above improved algorithm, and also designs and implements a complete user behavior analysis system. The system mainly includes three modules: data acquisition and filtering module. The web classification module and the results statistics module. The functions of the three modules are as follows: First, the user attribute information of the data acquisition and filtering module. The web behavior exists in the header of the HTTP package. In order to get the user's information, we need to parse and extract the HTTP packet. The flow of HTTP packet parsing designed and implemented in this paper is introduced in the data acquisition and filtering module. Second, the web page classification module is the main research object of this paper. This module is based on the improved word segmentation algorithm, preprocessing algorithm and the better classification effect of KNN and SVM classification algorithm. The process of mapping web pages to specific categories is implemented. Third, the result statistics module. This module summarizes and updates the classification results of the web pages visited by the user, and is directly connected with the personalized service system. The results of user behavior analysis are directly applied to personalized advertising feedback and other services. The web page classification system based on user behavior analysis in this paper is suitable for both online and offline web page classification. The experimental results show that this system can be used to classify web pages on line and offline. The improved preprocessing algorithm has a good correction to the classification accuracy, and the design of the result statistics module has obtained good results, which fully reflects the current interest of users. It provides a reference model for the research of personalized service system.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2011
【分類號】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫健,王偉,鐘義信;基于K-最近距離的自動文本分類的研究[J];北京郵電大學(xué)學(xué)報;2001年01期
2 尹中航,王永成,蔡巍;應(yīng)用支持向量機(jī)進(jìn)行網(wǎng)上信息自動分類[J];高技術(shù)通訊;2001年11期
3 李靜梅,孫麗華,張巧榮,張春生;一種文本處理中的樸素貝葉斯分類器[J];哈爾濱工程大學(xué)學(xué)報;2003年01期
4 田盛豐,黃厚寬;基于支持向量機(jī)的數(shù)據(jù)庫學(xué)習(xí)算法[J];計算機(jī)研究與發(fā)展;2000年01期
5 王繼成,潘金貴,張福炎;Web文本挖掘技術(shù)研究[J];計算機(jī)研究與發(fā)展;2000年05期
6 陸玉昌,魯明羽,李凡,周立柱;向量空間法中單詞權(quán)重函數(shù)的分析和構(gòu)造[J];計算機(jī)研究與發(fā)展;2002年10期
7 徐鳳亞,羅振聲;文本自動分類中特征權(quán)重算法的改進(jìn)研究[J];計算機(jī)工程與應(yīng)用;2005年01期
8 路斌,楊建武,陳曉鷗;一種基于SVM的多層分類策略[J];計算機(jī)工程;2005年01期
9 梁南元;書面漢語自動分詞系統(tǒng)—CDWS[J];中文信息學(xué)報;1987年02期
10 周運(yùn)清,蘇娜;網(wǎng)絡(luò)行為與社會控制[J];情報雜志;1999年03期
,本文編號:1428770
本文鏈接:http://www.lk138.cn/wenyilunwen/guanggaoshejilunwen/1428770.html