|||
参考文章:Metagenomic biomarker discovery andexplanation
杂志:Genome Biology 2011
此工具目的在于metagenomic biomarker discovery,具体的原理如下图:
主要分为3个步骤,输入的数据是m个样本,每个样本有n个属性。Figure illustrates in detail the for-mat of the input (a matrix with n rows and m columns)and the three steps performed by the computationaltool: the KW rank sum test on classes, the pairwise Wilcoxon test between subclasses of differentclasses, and the LDA on the relevant features.
具体如下:Each of the n features is represented with a positive-valued vector containing its abundances in the m samples, and each sample is associated with values describing its class and, optionally, subclass and/or originating subject. 每个样本具有n个属性,同时包含样本的class和subclass信息。The factorial KW rank sum test is applied to each feature with respect to the class factor; the subclassand subject information are used as stratifying sub-groups when present. Features that, according to the KW rank sum test, do not violate the null hypothesis of identical value distribution among classes (with default P-value, a=0.05) are not further analyzed. (KW rank sum test,秩和检验的一种, 针对每个属性进行检验,比较不同类之间的差异性,过滤掉p-value大于0.05的属性,留下p-value小于0.05的属性进一步分析)The pairwise Wilcoxon test is applied to retained features belonging to subclasses of different classes. For each feature, the pairwise Wilcoxon test is not satisfied if at least one comparison between subclasses has a P-value higher than the chosen a or if the sign of variation is not equal among all comparisons. For example, if a feature appears in samples from two classes with three subclasses each, all nine comparisons between subclasses in different classes must violate the null hypothesis, and all signs of the differences between medians must be consistent. The features that pass the pairwise Wilcoxontest are considered successful biomarkers. (针对第一步检验后留下来的属性,根据样本的subclass类别,基于Wilcoxon秩和检验,检测每个属性在subclass之间的差异性,假设A大类有3个小类,B大类有3个小类,A类中的每个小类需要与B类中的3个小类一一比较,如此经过9轮检验之后,挑选出在9次检查中均表明存在差异的属性,此界定为biomarker。)An LDA model is finally built with the class as dependent variable and the remaining feature values, subclass, and subject values as independent variables. This model is used to estimate their effect sizes, which are obtained by averaging the differences between class means (using unmodified feature values) with the differences between class means along the first linear discriminant axis, which equally weights features’ variability and discriminatory power. (最后采用LDA进行分析,LDA是linear discriminant analysis的简写,类别是因变量,筛选过后的属性、小类和样品是自变量,如此建立线性判别模型,然后利用模型前后的differences between class means去计算一个值,经过对数转化得到LDA score。)
LDA的材料介绍可参考:http://blog.csdn.net/sunmenggmail/article/details/8071502
上述的方法需要注意样本量的问题,When few samples are available, non-parametric tests like the Wilcoxon have reduced power to detect differences. This can affect LEfSe when subclasses are very small, preventing the overall test from even rejecting the null hypothesis. For this reason, small subclasses should be avoided when possible, for example, by excluding them from the problem or by grouping together all sub-classes with small cardinalities. For cases in which removing or grouping subclasses is not possible or disrupts the biological consistency of the analysis, LEfSe substitutes the Wilcoxon test with a test to compare whether subclass medians differ with the expected sign. The user can choose the subclass cardinality threshold at which this median comparison is substituted for the Wilcoxon test. 在样本量较少时,可以采取合并,或者替换Wilcoxon test的策略进行分析。
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-23 20:18
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社