# Unifrac软件PCA分析中unweighted和weighted说明

Unifrac PcoA 分析中 unweighted说明，不同环境之间的差异是根据不同环境特有的分支长度来区分的。

The UniFrac Metric

Calculating the UniFrac metric. The majority of options in the UniFrac interface make comparisons based on the UniFrac metric. The UniFrac metric measures the difference between two environments in terms of the branch length that is unique to one environment or the other. In the tree on the left (below), the division between the two environments (labeled red and blue) occurs very early in the tree, so that all of the branch length is unique to one environment or the other. This provides the maximum UniFrac distance possible, 1.0. In the tree on the right, every sequence in the first environment has a very similar counterpart in the other environment, so most of the branch length in the tree comes from nodes that have descendants in both environments. In the example, there is about as much branch length unique to each environment (red or blue) as shared between environments (purple), so the UniFrac value would be about 0.5. If the two environments were identical and all the same sequences were found in both, all the branch length would be shared and the UniFrac value would be 0. Weighted Unifrac Metrci 是为了区分相似或相同的序列，增加有多个序列分支的分支长度的比重

Calculating the Weighted UniFrac Metric. The UniFrac metric described above does not account for the relative abundance of sequences in the different environments because duplicate sequences contribute no additional branch length to the tree (by definition, the branch length that separates a pair of duplicate sequences is zero, because no substitutions separate them). Because the relative abundance of different kinds of bacteria can be critical for describing community changes, we have developed a variant of the algorithm, weighted UniFrac, which weights the branches based on abundance information during the calculations. Weighted UniFrac can thus detect changes in how many organisms from each lineage are present, as well as detecting changes in which organisms are present.

The figure below illustrates how the Weighted algorithm works. Branch lengths are weighted by the relative abundance of sequences in the square and circle communities; square sequences are weighted twice as much as circle sequences because there are twice as many total circle sequences in the dataset. The width of branches is proportional to the degree to which each branch is weighted in the calculations and the grey branches have no weight. Branches 1 and 2 have heavy weights since the descendants are biased towards the square and circles respectively. Branch 3 contributes no value since it has an equal contribution from circle and square sequences after normalization for different sample sizes. Normalizing Weighted UniFrac Values. If the phylogenetic tree is not ultrametric (i.e. if different sequences in the sample have evolved at different rates), comparing environments with the Cluster Environments or PCA analysis options using weighted UniFrac will place more emphasis on communities that contain taxa that have evolved more quickly. This is because these taxa contribute more branch length to the tree. In some situations, it may be desirable to normalize the branch lengths within each sample. This normalization has the effect of treating each sample equally instead of treating each unit of branch equally: the issues involved are similar to those involved in performing multivariate analyses using the correlation matrix, to treat each variable equally independent of scale, or using the covariance matrix, to take the scale into account. Normalization has the additional effect of placing all pairwise comparisons on the same scale as unweighted UniFrac (0 for identical communities, 1 for non-overlapping communities), allowing comparisons among different analyses with different samples. The scale of of the raw weighted UniFrac value (u) depends on the average distance of each sequence from the root. The normalization to correct for this effect is performed by dividing u by the distance scale factor D (see equation below), which is the average distance of each sequence from the root weighted by the number of times each sequence was observed in each community.

http://blog.sciencenet.cn/blog-491564-667282.html

## 全部精选博文导读

GMT+8, 2020-6-7 18:57