  本文選題:時(shí)態(tài)信息檢索 + 查詢時(shí)間意圖; 參考:《江蘇大學(xué)》2017年碩士論文

【摘要】:互聯(lián)網(wǎng)的普及帶來(lái)了信息資源的爆炸式增長(zhǎng),為用戶提供更多選擇機(jī)會(huì)的同時(shí)也增加了尋找有效信息的難度,于是如何利用搜索引擎從海量的信息中篩選出滿足用戶需求的文檔成為了一個(gè)重要的挑戰(zhàn)。近年來(lái),互聯(lián)網(wǎng)中包含時(shí)間信息的網(wǎng)頁(yè)與查詢數(shù)目不斷增多,時(shí)態(tài)信息檢索(Temporal Information retrieval,TIR)成為研究人員關(guān)注的熱點(diǎn)。它主要研究如何使用有效的技術(shù)提取網(wǎng)頁(yè)中的時(shí)態(tài)信息,分析查詢的時(shí)間意圖以及建立與時(shí)間有關(guān)的檢索排名模型等以改善搜索引擎的檢索質(zhì)量。信息檢索中具有時(shí)間意圖的查詢分為兩種,一種查詢中包含時(shí)間表達(dá)式,明確指定時(shí)間約束,稱為顯式時(shí)間查詢;而另一種查詢中沒有提供明確的時(shí)間標(biāo)準(zhǔn),但查詢的時(shí)間意圖在某個(gè)特定的時(shí)間區(qū)間,稱為隱式時(shí)間查詢。據(jù)統(tǒng)計(jì),互聯(lián)網(wǎng)中超過7%的查詢包含隱式時(shí)間意圖,大約1.5%的查詢包含明確的時(shí)間約束,可見隱式時(shí)間查詢?cè)诨ヂ?lián)網(wǎng)查詢中占據(jù)的比例更大,有更多的研究工作有待開展。本論文研究如何分析隱式時(shí)間查詢的時(shí)間意圖與優(yōu)化檢索性能,主要的工作內(nèi)容歸納如下:(1)對(duì)于隱式時(shí)間查詢,提出了一種結(jié)合語(yǔ)義網(wǎng)DBpedia和排名前k個(gè)文檔分析查詢時(shí)間意圖的方法。如果用戶查詢的內(nèi)容是關(guān)于著名人物或者歷史上某個(gè)重大事件,則查詢DBpedia(基于維基百科的語(yǔ)義網(wǎng))得到的具體的時(shí)間區(qū)間作為查詢的時(shí)間意圖;其他類型的查詢使用排名前k個(gè)文檔內(nèi)容中出現(xiàn)頻率較高的時(shí)間表達(dá)式分析查詢的時(shí)間意圖。(2)在語(yǔ)言模型的基礎(chǔ)上提出一種支持隱式時(shí)間查詢的文檔排名模型,考慮時(shí)間不確定性因素計(jì)算各個(gè)文檔產(chǎn)生查詢的概率作為文檔時(shí)間相關(guān)性得分,最后線性結(jié)合時(shí)間相關(guān)性得分和內(nèi)容相關(guān)性得分對(duì)文檔重新排序。(3)使用NTCIR-11會(huì)議Temporal Information Access(Temporalia)任務(wù)中的文檔集作為實(shí)驗(yàn)數(shù)據(jù),評(píng)價(jià)本文提出的分析隱式時(shí)間查詢意圖方法和文檔排名模型的性能。首先與已提出的幾種分析查詢時(shí)間意圖的方法比較,實(shí)驗(yàn)結(jié)果表明在計(jì)算文檔相關(guān)性得分前分析查詢的時(shí)間意圖具有一定的意義,本文提出的結(jié)合DBpedia和排名前k個(gè)文檔方法能夠較好地分析查詢時(shí)間意圖。在得到查詢時(shí)間意圖的基礎(chǔ)上,比較本文提出的方法與目前已存在的考慮時(shí)間因素排名方法的性能,結(jié)果顯示考慮時(shí)間因素的排名模型中大多數(shù)的指標(biāo)值都高于僅考慮內(nèi)容相關(guān)性的初始排名,說(shuō)明在檢索模型中考慮時(shí)間相關(guān)性有利于改善檢索質(zhì)量。與其他的排名方法相比,本文提出的基于語(yǔ)言模型的排名方法性能較好。
[Abstract]:The popularity of the Internet has brought explosive growth of information resources, providing users with more choice opportunities and increasing the difficulty of finding effective information.Therefore, how to use search engines to select documents from massive information to meet the needs of users has become an important challenge.In recent years, the number of web pages and queries containing time information in the Internet has been increasing. Temporal Information retrieval (TIR) has become a hot topic for researchers.It mainly studies how to use effective techniques to extract temporal information from web pages, analyze the temporal intention of queries and establish time-related search ranking models to improve the search quality of search engines.There are two kinds of queries with time intention in information retrieval. One kind of query contains a time expression, which explicitly specifies time constraints, which is called explicit time query, and the other kind of query does not provide a clear time standard.But the time intention of the query is in a specific time interval, which is called implicit time query.According to statistics, more than 7% of the queries in the Internet contain implicit time intention, and about 1.5% of the queries contain explicit time constraints. It can be seen that implicit time queries occupy a larger proportion in Internet queries, and more research work needs to be carried out.In this paper, we study how to analyze the time intention of implicit time query and optimize its retrieval performance. The main work is summarized as follows: 1) for implicit time query,This paper presents a method of analyzing query time intention by combining semantic web DBpedia with top k documents.If the content of a user query is about a famous person or a major event in history, the specific time interval obtained by the query DBpedia (Wikipedia based semantic Web) is taken as the time intention of the query.Other types of queries analyze the time intention of the query using the high frequency time expression in the top k document contents.) based on the language model, a document ranking model supporting implicit time query is proposed.Considering the time uncertainty factor, the probability of each document producing query is calculated as the document time correlation score.Finally, a linear combination of time correlation score and content correlation score is used to resort the document using the document set in the NTCIR-11 meeting Temporal Information access temporary Task as experimental data.The performance of the implicit time query intention method and the document ranking model proposed in this paper is evaluated.The experimental results show that it is significant to analyze the time intention of the query before calculating the correlation score of the document.The proposed method combined DBpedia with the top k documents can well analyze the query time intention.On the basis of obtaining the time intention of the query, this paper compares the performance of the proposed method with the existing ranking method considering time factors.The results show that most of the index values in the ranking model taking into account time factors are higher than the initial ranking which only considers the content correlation, which indicates that considering time correlation in the retrieval model is beneficial to improve the retrieval quality.Compared with other ranking methods, the proposed ranking method based on language model has better performance.


