|||
文章:Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples
杂志:PLoS Comput Biol.
年份:2009
原始数据:
数据格式如下,其中具体的值可以是任何要分析的值。原文描述是the relative abundance of specific features within each sample。可以是number of 16S rRNA clones assigned to a specific taxon,或者某条pathway上mapping reads数目。应当是每个group都会生成这样的表格,行表示特征,列表示样本!
Data normalization:
表内的值用每个样品中的相对含量表示,where the cell in the ith row and the jthcolumn (which we shall denote fij) is the proportion of taxon i observed in individual j.
Analysis of differential abundance,丰度的差异计算:
针对每个group,可以针对各菌计算其平均值和标准差值。
然后计算t值检验差异情况
显著性的评估,Assessing significance:
通过计算t值得到p-value去判断是否存在差异,但是这个依赖于分布是符合正态的!在此并不假定符合正态分布,而是通过permutation的方法去模拟,We do not make this assumption, but rather estimate the null distribution of ti non-parametrically using a permutation method。This procedure, also known as the nonparametric t-test has been shown to provide accurate estimates of significance when the underlying distributions are non-normal。
进行permutation的操作,we randomly permute the treatment labels of the columns of the abundance matrix and recalculate the t statistics. Note that the permutation maintains that there are n1 replicates for treatment 1 and n2 replicates for treatment 2. Repeating this procedure for B trials, we obtain B sets of t statistics: t10b, …, tM0b, b = 1, …, B, where M is the number of rows in the matrix. For each row (feature), the p-value associated with the observed t statistic is calculated as the fraction of permuted tests with a t statistic greater than or equal to the observed ti:
当样品量比较少时,This approach is inadequate for small sample sizes in which there are a limited number of possible permutations of all columns. As a heuristic, if less than 8 subjects are used in either treatment, we pool all permuted t statistics together into one null distribution and estimate p-values as:
选择8这个界限是根据实验的经验设置的,Note that the choice of 8 for the cutoff is simply heuristic based on experiments during the implementation of our method. 当样品量比较少时,Our approach is specifically targeted at datasets comprising multiple subjects — for small data-sets approaches such as that proposed by Rodriguez-Brito et. al. might be more appropriate.
在本软件中用的是1000次permutations, permutations和显著性阈值有一定关系,在一定情况下,permutations和p值之间是一种转换关系。Unless explicitly stated, all experiments described below used 1000 permutations. In general, the number of permutations should be chosen as a function of the significance threshold used in the experiment. Specifically, a permutation test with B permutations can only estimate p-values as low as 1/B (in our case 10−3).
p值校正,Multiple hypothesis testing correction:
在文中没有用Bonferroni correction,而是利用FDR(false discoverty rata)。In this context, the significance of a test is measured by a q-value, an individual measure of the FDR for each test.
Given an ordered list of p-values, p(1)≤p(2)≤…≤p(m), (where m is the total number of features), and a range of values λ = 0, 0.01, 0.02, …, 0.90.Next, we fit with a cubic spline with 3 degrees of freedom, which we denote , and let .最后,q-value的计算:we estimate the q-value corresponding to each ordered p-value. First, . Then for i = m-1, m-2, …, 1。
the hypothesis test with p-value has a corresponding q-value of . Note that this method yields conservative estimates of the true q-values, i.e. . Our software provides users with the option to use either p-value or q-value thresholds, irrespective of the complexity of the data.
对于一些特殊特征的处理,Handling sparse counts:
利用费舍尔检验,We compare the differential abundance of sparsely-sampled (rare) features using Fisher's exact test.
核心在于针对不同的特征分为t检验和Fish exact检验,t检验通过permutation去估算分布模型,从而计算p值,计算好的p值,利用FDR去判断存在显著差异的界限值。
此软件主要针对两组之间的比较,兼顾考虑了物种中分布广泛的菌(t permutation的分析)和分布稀少的菌(卡方检验)。
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-6-2 21:37
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社