||
我看到GATK的best practice中说道“ Duplicately sequenced molecules shouldn't be counted as additional evidence for or against a putative variant. By marking these reads as duplicates the algorithms in the GATK know to ignore them. ” 我在网上也查过好多次了,可是还是不太明白这个duplicate是什么?怎么产生的?在看过http://seqanswers.com/forums/showthread.php?t=6854这个帖子后,我试着翻译malachig的回帖 原文: "A duplicate could be PCR effect or reading same fragment twice, there is no way to tell." Yes this is the heart of the matter... The degree to which you expect duplicate fragments is highly dependent on the depth of the library and the type of library being sequenced (whole genome, exome, transcriptome, ChIP-Seq, etc.). The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. Furthermore you gain the computational benefit of reducing the number of reads to be processed in downstream steps. However, removing duplicates, comes at the cost of setting a hard cap (and a rather low one at that) on the dynamic range of measurement you are going to produce. This has been discussed in various places on this forum (not a complete list): Redundant reads are removed from ChIP-seq, what about RNA-seq Duplicate reads removal Threshold for duplicate removal Heavy read stacking on the solexa platform Should identical reads be reduced to a single count? Removing duplicate reads from multigig .csfasta Source of duplicate reads and possible ways of reducing “一个duplicate可能是PCR效应或者是相同的片段测了两次,这个没有办法区分” 这是问题的核心...你期望的duplicate片段高度依赖library测序的深度和被测library的类型(全基因组,外显子,转录组, 染色体免疫共沉淀测序,等)remove duplicate的主要目的是减轻由建库导入的PCR扩增偏离的效应。 此外,在接下来的处理中减少了reads数也减轻了计算机的负担。然而,remove duplicate的代价是带来了一个严格的限制 ,严格限制了你测量的动态范围(相当低的范围内)。原文: [mw_shl_code=cpp,true]"Also how do you define a duplicate?" As you suggest, this can be done in at least two ways. Both have caveats. If you define duplicates as reads with identical sequence you may underestimate the presence of duplicates in your library. Read errors present in your library will result in reads with apparently different sequences that could have actually come from PCR amplifications of the same cDNA fragment. Accurate removal of duplicates is therefore dependent on a low error rate in your library. One advantage of this approach is that it reduces the pool of reads before mapping and thereby reduces alignment time. Using mapping positions of the reads avoids the influence of library sequencing error rate described above (mostly). However, two reads that have identical mapping positions do not necessarily represent the same underlying cDNA fragment. Especially if you consider only the outer mapping coordinates (which is a common strategy). Reads derived from transcripts and then mapped to the genome might appear to have the same outer coordinates, but have variations internally (corresponding to alternative isoforms with different exon boundaries, exons skipped, etc.). Furthermore, two reads may have completely identical mapping coordinates, but still correspond to distinct cDNA fragments. Imagine two identically mapping reads that have a single base difference that corresponds to a polymorphism. cDNA fragments may be derived from both the maternal and paternal alleles of a diploid genome. These are distinct fragments containing biologically pertinent information. If you removed duplicates purely on the basis of mapping coordinates, you might lose some of this information. As you can see there is no simple answer to this question. Identifying true duplicate fragments is not black-and-white. This perhaps argues for not trying to remove them in the first place. But on the other hand, the presence of duplicates that do represent PCR amplification bias can cause problems for downstream analysis... In general, having longer reads and paired-end reads as opposed to single-end reads can help in correctly identifying duplicates. But there is still no way to tell for sure whether a particular read corresponds to an amplification bias or not. In library types where you expect approximately uniform coverage of the target space (e.g. a whole genome library aligned to a reference genome) you can watch for areas of unusually bad read stacking and correct for it. But even with this you have to be careful, because your genome is not the reference genome, and genuine copy number variations can result in biologically meaningful deviations from uniform coverage that you may actually be interested in... The type of experiment and analysis being conducted may be influential in deciding how to approach the issue. If you provide more details of your particular library type and analysis goals, someone may be able to provide specific advice. In expression analysis, I do not remove duplicates but I do attempt to quantify library complexity for every library and watch out for libraries that seem to be outliers in this regard. You can view an example of this here: ALEXA-seq (see the figures under: 'summary of library complexity - estimated by tag redundancy per million reads and compared to other libraries') Ultimately, the best way to deal with duplicates that correspond to PCR amplification bias is to reduce their generation in the first place. Avoid using small amounts of input material (i.e. a molecular bottleneck) and keep amplification to an absolute minimum[/mw_shl_code]. 翻译内容如下: [mw_shl_code=cpp,true]“如何定义一个duplicate呢?” 就像你说到的那样,有两个定义,但是都有不足。 如果你定义duplicate就是两个相同的序列,那么你可能忽视了library中duplicate的存在。存在于library中的read error会导致 截然不同的reads,然而他们都是同一个的cDNA的PCR扩增产物。因此,精确的remove duplicate依赖于library中的低错误率。 这种方法的一个好处是降低了mapping之前的reads数,因此节省了alignment的时间。 我们上面讨论了利用reads mapping的位置避免library测序错误率的影响。然而,两个相同的reads具有同一个mapping位置 也不能完全说明他们来自同一个cDNA片段。尤其是如果你只考虑the outer mapping coordinates(通常也是如此) 【翻译成mapping坐标的外缘,不知道合不合理】。reads来源自转录本然后mapping到基因组,也许可能具有相同的坐标外缘 ,但是内部不同(就像可变剪接有不同的外显子边界,外显子跳跃,等)。此外,两条reads可能具有完全相同的mapping坐标 ,但是不同的cDNA片段。想象一下,两条相同mapping的reads具有一个单碱基多态,他们的cDNA片段可能就是来自父本和母 本二倍体基因组的等位基因。这些不同的片段包含了生物学相关信息。如果单纯以mapping坐标来移除这些duplicate,那么你 可能失去了这些信息。 正如你看到的这个问题没有一个简单的答案。识别真正的duplicate不是非黑即白。这也许赞成了不要试图移除他们,但是 另一方面,duplicate的存在确实表现出了PCR扩增的偏移并且对下游分析产生问题 一般而言,更长的和paired-end相对于single-end reads可以帮助你正确的识别duplicate。但是,依然不能肯定的说一个read 到底是不是扩增偏移。在某种你期望目标区域有一个均衡的覆盖度的library中(例如,一个全基因组aligned to 一个参考基因组 ),要特别留意异常糟糕的reads堆积并且纠正它。但是,即便是这样你也要特别小心,因为你的基因组不是参考基因组,and 真正的来自均衡覆盖度的拷贝数变异具有生物学意义,或许你对此感兴趣也不一定呢 实验的类型和分析的方案可能影响到如何解决这个问题。如果你能提供library更多的细节和分析的目标,或许有人能给你提供 特别的建议。在表达量分析中,我不移除duplicate,但是我尝试量化每一个library的复杂度,并且密切注意异常的library。 这里有一个例子: ALEXA-seq (see the figures under: 'summary of library complexity - estimated by tag redundancy per million reads and compared to other libraries') 最后,最好的处理PCR扩增偏移产生的duplicate的办法就是降低它们的产生。避免使用量少的材料(分子水平的阻碍),尽量 少扩增。[/mw_shl_code] ~~~~~~~~~~~关于duplicate我感觉我差不多明白是怎么回事了 |
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-27 20:19
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社