zczhou的个人博客分享 http://blog.sciencenet.cn/u/zczhou



已有 27730 次阅读 2013-3-5 11:11 |系统分类:科研笔记|关键词:学者| PCA, Weighted, Unifrac, unweighted, PCoA

Unifrac PcoA 分析中 unweighted说明,不同环境之间的差异是根据不同环境特有的分支长度来区分的。

The UniFrac Metric


Calculating the UniFrac metric. The majority of options in the UniFrac interface make comparisons based on the UniFrac metric. The UniFrac metric measures the difference between two environments in terms of the branch length that is unique to one environment or the other. In the tree on the left (below), the division between the two environments (labeled red and blue) occurs very early in the tree, so that all of the branch length is unique to one environment or the other. This provides the maximum UniFrac distance possible, 1.0. In the tree on the right, every sequence in the first environment has a very similar counterpart in the other environment, so most of the branch length in the tree comes from nodes that have descendants in both environments. In the example, there is about as much branch length unique to each environment (red or blue) as shared between environments (purple), so the UniFrac value would be about 0.5. If the two environments were identical and all the same sequences were found in both, all the branch length would be shared and the UniFrac value would be 0.


Weighted Unifrac Metrci 是为了区分相似或相同的序列,增加有多个序列分支的分支长度的比重

Calculating the Weighted UniFrac Metric. The UniFrac metric described above does not account for the relative abundance of sequences in the different environments because duplicate sequences contribute no additional branch length to the tree (by definition, the branch length that separates a pair of duplicate sequences is zero, because no substitutions separate them). Because the relative abundance of different kinds of bacteria can be critical for describing community changes, we have developed a variant of the algorithm, weighted UniFrac, which weights the branches based on abundance information during the calculations. Weighted UniFrac can thus detect changes in how many organisms from each lineage are present, as well as detecting changes in which organisms are present.

The figure below illustrates how the Weighted algorithm works. Branch lengths are weighted by the relative abundance of sequences in the square and circle communities; square sequences are weighted twice as much as circle sequences because there are twice as many total circle sequences in the dataset. The width of branches is proportional to the degree to which each branch is weighted in the calculations and the grey branches have no weight. Branches 1 and 2 have heavy weights since the descendants are biased towards the square and circles respectively. Branch 3 contributes no value since it has an equal contribution from circle and square sequences after normalization for different sample sizes.





其中normalized weighted 选项通过将Unifrac distance value 除以距离比例常数D 使不同进化速率的分类单元在计算unifrac distance时是按照同等的比例对待。进化快的分支 分支长度长,用normalized weighted方法会将长短不一的分支按照分支中个序列距离root的平均值按比例比较,就是去比例比较。

Normalizing Weighted UniFrac Values. If the phylogenetic tree is not ultrametric (i.e. if different sequences in the sample have evolved at different rates), comparing environments with the Cluster Environments or PCA analysis options using weighted UniFrac will place more emphasis on communities that contain taxa that have evolved more quickly. This is because these taxa contribute more branch length to the tree. In some situations, it may be desirable to normalize the branch lengths within each sample. This normalization has the effect of treating each sample equally instead of treating each unit of branch equally: the issues involved are similar to those involved in performing multivariate analyses using the correlation matrix, to treat each variable equally independent of scale, or using the covariance matrix, to take the scale into account. Normalization has the additional effect of placing all pairwise comparisons on the same scale as unweighted UniFrac (0 for identical communities, 1 for non-overlapping communities), allowing comparisons among different analyses with different samples. The scale of of the raw weighted UniFrac value (u) depends on the average distance of each sequence from the root. The normalization to correct for this effect is performed by dividing u by the distance scale factor D (see equation below), which is the average distance of each sequence from the root weighted by the number of times each sequence was observed in each community.



下一篇:approximate Likelihood-Ratio Test 和 standard bootstrap区别


该博文允许注册用户评论 请点击登录 评论 (0 个评论)


Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2020-6-7 18:57

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社