基于序列特征的多位點亞細胞定位預測研究
發(fā)布時間:2018-05-31 15:05
本文選題:亞細胞定位 + 多標簽學習 ; 參考:《東北師范大學》2017年碩士論文
【摘要】:蛋白質的功能與其在細胞中的定位有著密切的關系,新合成的蛋白質必須被轉運到特定的細胞器(即亞細胞)中才能正確的行使其功能。因此,預測蛋白質的亞細胞定位,在確定一個未知蛋白質的功能,了解蛋白質相互作用,進而理解各種生物過程,研究一些疾病的發(fā)病機制等方面有著及其重要的意義。傳統(tǒng)的生物實驗技術如:亞細胞分離、融合綠色熒光蛋白、質譜和同位素親和標簽等可提供比較精確的亞細胞定位數(shù)據(jù),但是這些實驗多比較昂貴且耗時,單純依靠這些實驗技術來進行亞細胞定位研究代價通常比較大。近年來,隨著生物數(shù)據(jù)的極大豐富,生物信息學這一交叉學科得到了迅猛發(fā)展,越來越多的研究人員熱衷于利用各種計算技術來輔助解決熱點生物學問題,用機器學習方法進行蛋白質亞細胞定位預測研究即是其中的熱點之一,也是本文的主要研究目標。經過研究人員多年的努力,機器學習算法輔助亞細胞定位預測的研究取得了一系列很有意義的成果,各種計算方法相繼產生,亞細胞定位預測的精度不斷提高,亞細胞定位相關的預測平臺相繼出現(xiàn),這些都為后續(xù)的蛋白質功能分析提供了有價值的信息。盡管研究有了很大的進展,其中仍有需要提升或改進的地方,大致分為以下三點:(1)大多數(shù)現(xiàn)有的方法只適用于二分類的數(shù)據(jù),但是實際上,許多蛋白質可能有一個或多個亞細胞位置,我們需要的是能進行多標簽亞細胞定位預測的分類器。(2)雖然有一些方法引入了多標簽學習技術來識別有一個或者多個亞細胞位點的蛋白質,但它們的數(shù)據(jù)集中含有多標簽的蛋白質數(shù)目過少。(3)一些預測分類器采用了基因本體(Gene Ontology)的方法來提高預測準確率,但是這種方法提出的特征維數(shù)太大,提取過程比較繁瑣,需要有效的降維方法來進行降維。本文在對目前的蛋白質亞細胞定位預測算法進行了充分的比較研究基礎上,針對現(xiàn)有分類器的不足,提出了相應的改進措施,并從數(shù)據(jù)集的獲取、蛋白質序列特征提取方法、亞細胞定位預測算法以及預測算法的性能評估等四方面進行了詳細的闡述。本文提出的方法,采用的數(shù)據(jù)集來自于被廣泛認可的工具iLoc-Animal,其類別的“多樣度”達到1.8922,預測總類別數(shù)達到20個;序列特征提取方法采用了氨基酸組成AAC(amino acid composition)和聚類的特征LIFT,克服了用GO來構造特征的繁瑣和耗時;預測算法在比較了常用的多標簽預測算法和策略基礎上,最終采用了多標簽K近鄰(multi-label K-nearest neighbor);分類器性能測試階段,本文采用了十折交叉驗證方法,對準確率(Precision)、精確率(Accuracy)、召回率(Recall)、絕對正確率(Absolute-True)、絕對錯誤率(Absolute-False)等五個驗證指標進行了評估,并同經典算法iLoc-Animal進行了比較。實驗結果表明,本文的方法成功分類的準確度(Accuracy)為74.35%和絕對正確率(Absolute-True)為71.17%,明顯高于iLoc-Animal中的準確度(62.28%)和絕對正確率(45.62%)并且,各個評價指標本文的結果也都好于iLoc-Animal。除了預測精度較高以外,本文的預測方法還有實現(xiàn)簡單,響應速度快等特點,希望本文的工作能對當前的蛋白質亞細胞定位預測研究有啟發(fā)和促進作用。
[Abstract]:The function of a protein is closely related to its location in a cell. The newly synthesized protein must be transported to a specific organelle (or subcellular) to perform its function correctly. Therefore, the prediction of the subcellular localization of proteins, the function of an unknown protein, the understanding of protein interaction, and the understanding of various kinds of proteins. Biological processes are of great significance in studying the pathogenesis of some diseases. Traditional biological experiments, such as subcellular separation, fusion of green fluorescent protein, mass spectrometry, and isotopic affinity tags, can provide more accurate subcellular location data, but these experiments are much more expensive and time-consuming and rely solely on these facts. In recent years, with the great abundance of biological data, the cross discipline of bioinformatics has developed rapidly. More and more researchers are keen to use various computational techniques to help solve hot biologic problems and use machine learning methods to carry out protein subfining. The study of cell location prediction is one of the hot spots and also the main research goal of this article. After many years of researchers' efforts, a series of meaningful results have been obtained by the research of machine learning algorithm assisted subcellular location prediction. Various calculation methods have been produced successively, the accuracy of subcellular location prediction is constantly improved, and subcellular localization has been improved. In spite of great progress, there are still three points that need to be promoted or improved: (1) most existing methods are suitable for two categories of data, but in fact, many proteins are in fact, many proteins are in fact. There may be one or more subcellular locations, and what we need is a classifier that can predict multi label subcellular localization. (2) although some methods have introduced multiple label learning techniques to identify proteins with one or more subcellular loci, the number of proteins with multiple labels is too small. (3) some preconditioning The classifier adopts the method of Gene Ontology (Gene Ontology) to improve the accuracy of prediction. However, the feature dimension of this method is too large, the extraction process is more complicated and the effective dimensionality reduction method is needed to reduce the dimension. The shortcomings of the existing classifier are given, and the corresponding improvement measures are put forward, and the four aspects, such as the acquisition of data sets, the extraction of protein sequence features, the algorithm of subcellular location prediction and the performance evaluation of the prediction algorithm, are elaborated in detail. The method proposed in this paper comes from the widely recognized tool iLoc-Animal, The "diversity" of the category has reached 1.8922 and the total number of categories is 20. The sequence feature extraction method uses the amino acid composition AAC (amino acid composition) and the clustering feature LIFT to overcome the cumbersome and time-consuming of using GO to construct characteristics. Using the multi label K nearest neighbor (multi-label K-nearest neighbor); the classifier performance testing stage, this paper uses ten fold cross validation method, the accuracy rate (Precision), the accuracy rate (Accuracy), the recall rate (Recall), the absolute correct rate (Absolute-True), the absolute error rate (Absolute-False) and other five verification indicators, and the same as the classical calculation. The results of the method iLoc-Animal are compared. The experimental results show that the accuracy of the method (Accuracy) is 74.35% and the absolute correct rate (Absolute-True) is 71.17%, which is obviously higher than the accuracy (62.28%) and the absolute correct rate (45.62%) in the iLoc-Animal, and the results of each evaluation index are better than the iLoc-Animal. except the prediction. Besides the high precision, the prediction method of this paper has the characteristics of simple realization and quick response. It is hoped that the work of this paper can enlighten and promote the current research of protein subcellular location prediction.
【學位授予單位】:東北師范大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:Q26;TP181
【參考文獻】
相關期刊論文 前4條
1 鄭珊珊;石卓興;代琦;姚玉華;;蛋白質亞細胞定位預測研究進展[J];科技視界;2014年12期
2 李立奇;萬瑛;;蛋白質的亞細胞定位預測研究進展[J];免疫學雜志;2009年05期
3 張松;黃波;夏學峰;孫之榮;;蛋白質亞細胞定位的生物信息學研究[J];生物化學與生物物理進展;2007年06期
4 周志華,陳世福;神經網絡集成[J];計算機學報;2002年01期
相關博士學位論文 前1條
1 樊國梁;基于多類特征融合的蛋白質亞線粒體定位預測研究[D];內蒙古大學;2013年
,本文編號:1960203
本文鏈接:http://www.lk138.cn/kejilunwen/zidonghuakongzhilunwen/1960203.html
最近更新
教材專著