xbinbzy的个人博客分享 http://blog.sciencenet.cn/u/xbinbzy

博文

Ribosomal Database Project(RDP) Classifier的物种分类原理

已有 12582 次阅读 2015-12-8 16:41 |个人分类:科研文章|系统分类:科研笔记| 16s, RDP, Classifier

文章:Na ̈ıve Bayesian Classifier for Rapid Assignment of rRNA Sequencesinto the New Bacterial Taxonomy


   RDP Classifier是基于朴素贝叶斯的原理进行分类,其中关键需要计算后先验概率和条件概率。

   1)其中选择8-bp的subsequences作为特征序列去计算相应的概率。理由如下:The Classifier uses a feature space consisting of all possible 8-base subsequences (words). Word sizes between 6 and 9 bases were tested in preliminary experiments. Sizes of 8 and 9 bases gave nearly identical results, while sizes of 6 and 7 bases were less accurate, especially with shorter test sequences (not shown). As there are fewer possible words of size 8 than size 9, size 8 was chosen for all further work to reduce memory requirements. The position of a word in a sequence is ignored(忽略subsequences出现的位置). As with text-based Bayesian classifiers, only those words occurring in the query contribute to the score. A similar word-based classification scheme has been used to search for horizontal gene transfer events in whole-genome sequences.

   2)每个subsequences出现的概率计算。具体计算方法:Let W 􏰁 {w1, w2, . . ., wd} be the set of all possible eight-character subsequences (words). From the corpus consisting of N sequences, let n(wi) be the number of sequences containing subsequence wi. The expected-likelihood estimate (determined with the Jeffreys-Perks law of succession) calculated for each word over the entire corpus with the formula Pi 􏰁=[n(wi) + 0.5]/(N 􏰀+1) was used as a word-specific prior estimate of the likelihood of observing word wi in an rRNA sequence. The values 0.5 in the numerator and 1 in the denominator keep the probabilities in the range 0 <Pi 􏰂< 1.

   3)条件概率的计算。具体计算策略:For genus G with a training set consisting of M sequences, let m(wi) be the number of these sequences containing word wi. The conditional probability that a member of G contains wi was estimated with the equation P(wi|G) 􏰁=[m(wi) 􏰀+ Pi]/(M 􏰀+1). Ignoring the dependency between words in an individual sequence, the joint probability of observing from genus G a (partial) sequence, S, containing a set of words, V 􏰁{v1, v2, . . ., vf} (V W), was estimated as P(S|G) 􏰁 = πP(vi|G).

   4)基于贝叶斯公式计算值的物种分类。具体分类策略:By Bayes’theorem, the probability that an unknown query sequence, S, is a member of genus G is P(G|S) 􏰁= P(S|G) x 􏰄P(G)/P(S), where P(G) is the prior probability of a sequence being a member of G and P(S) the overall probability of observing sequence S (from any genus). Assuming all genera are equally probable (equal priors), the constant terms P(G)and P(S) can be ignored. We classify the sequence as a member of the genus giving the highest probability score, but we ignore the actual numerical probability estimate.

   5)鉴于每个序列的计算都是不同的,为此为了保证结果的准确性,采用bootstrap的策略。具体描述如下: For each query sequence, the collection ofall eight-character subsequences (words) in the query was first calculated. Normally, when data consist of independent features, a bootstrap sample size equal to the number of features in the original sample is chosen. In this case, the number of completely independent features equals the number of nonoverlapping words. So for each bootstrap trial, a subset of one-eighth of the words was randomly chosen (with replacement) and the words in this subset were then used to calculate the joint probability. The number of times a genus was selected out of 100 bootstrap trials was used as an estimate of confidence in the assignment to that genus. For higher-rank assignments, we sum the results for all generaunder each taxon.



https://blog.sciencenet.cn/blog-306699-941939.html

上一篇:文章解读:Development of a Dual-Index Sequencing Strategy...
下一篇:metagenomics中的alpha diversity理解
收藏 IP: 183.13.122.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-7-31 03:25

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部