陈俭海分享 http://blog.sciencenet.cn/u/chenjianhai 科学需更需要勇气。你敢,这世界就不同

博文

全外显子测序分析流程中,什么时候加入外显子区间文件

已有 3414 次阅读 2020-9-29 23:09 |系统分类:科研笔记| 外显子测序

When should I restrict my analysis to specific intervals? 

This document covers the reasoning behind the use of genomic intervals. If you're looking for instructions on how to use intervals in practice, including argument details and supported formats, please see this doc.

Depending on what you're trying to do, there are many reasons why you might want to tell a tool to operate on a subset of genomic regions only. We distinguish four main types of reasons for doing so:

  • You want to run a quick test on a subset of data (often used in troubleshooting)

  • You want to parallelize execution of an analysis across genomic regions

  • You need to exclude regions that have bad or uninformative data where a tool is getting stuck

  • The analysis you're running should only take data from those subsets due to how the underlying algorithm works

The first three should be fairly self-explanatory, but let's go into a bit more detail on the fourth one.


In a nutshell

  • Whole genome analysis:
    Intervals are not required but they can help speed up analysis by eliminating "difficult" regions and enabling parallelism

  • Exome analysis and other targeted sequencing:
    You must provide the list of targets, with padding, to exclude off-target noise. This will also speed up analysis and enable parallelism.


Whole genome analysis

It is not strictly necessary to restrict analysis to intervals when working with whole genomes, since presumably you're interested in all of it. However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. In addition, defining whole-genome intervals allows you to parallelize execution across intervals using the scatter gather mode of parallelism.

We share the lists of "good" whole-genome intervals that we use in our production pipelines for human analysis in our resource bundle (see Download page).


Exome analysis and other targeted sequencing

By definition, exome sequencing and other targeted sequencing data don’t cover the entire genome, so most analyses can be restricted to just the capture targets (genes or exons) to save processing time and enable scatter gather parallelism. In addition, there are some processing steps, such as BQSR, that should be restricted to the capture targets in order to eliminate off-target sequencing data, which is uninformative and is a source of noise.

You should use the list of target intervals that corresponds to the library preparation method that was used to generate the data. If you're working with exome sequencing data that was prepared by someone else, you'll need to find out what kit was used; the kit manufacturers typically provide the lists of intervals that correspond to their kits on their website. We cannot provide you with a suitable interval lists unless you are sure that your data was sequenced at the Broad.


Important notes:

Whatever you end up using intervals for, keep this in mind: for tools that output a BAM or VCF file, the output file will only contain data from the intervals you specified. Any data that falls outside these intervals will be lost to downstream analysis.

In general we recommend adding some padding to the intervals in order to include the flanking regions (typically about 100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use a list of intervals.

You will have noticed by now that we do not provide detailed guidelines for which tool should or should not use an interval list in this article. For tool-by-tool recommendations, please see the example commands in the individual tool docs; they show the most common recommended usage for each. See also the Best Practices documentation for up to date implementation notes.


    现在全外显子(wes)测序依然对很多疾病家系的致病变异鉴定起着很重要的作用。


    遗憾的是,现在在中文网站找到的关于外显子测序分析的流程,似乎语焉不详,也充满错误的计算流程。


    本文对其中几个关键问题做一个说明。

    1. 全外显子测序和全基因组测序的差别,流程可以是一样的吗?

      答案是否定的。根据https://gatk.broadinstitute.org/hc/en-us/articles/360035889551?id=4133的说明,全外显子存在脱靶效应。需要用区间来锁定。当然wgs也可以利用区间锁定的办法,获得某些区间的变异,来排除一些reference质量差的区域SNP。

    2. 如何加入外显子区间信息,在哪里加入该信息?

      答案:本文最前面的英文介绍是gatk的说明,里面说明了从BQSR就要开始加入-L 参数以便进行区间校正,排除脱靶测序数据。

    3. 安捷伦外显子芯片有很多bed文件,到底用哪一个?

      答案:关于interval文件到底应该用安捷伦芯片的哪一个文件,很多人也存在很多纠结。例如英文网站也有很多人问,似乎老外也不是很明白。他们的回答也不是很明确。https://www.biostars.org/p/422896/  说明这个问题很普遍。

      安捷伦的全外显子测序文件,有四个bed文件,其中有个padding文件。上面GATK说明里面,推荐用padding文件。

    中文的全外显子测序流程,要么是用全基因组测序流程来蒙混过关。要么是不知道哪里加入排除脱靶效应的参数。有的认为是在HaplotypeCaller 这一步加入,这显然不是GATK说明里面推荐的。

    一些国内网站推荐的分析流程都存在一些问题,例如知乎中的一个,https://zhuanlan.zhihu.com/p/137078769该分析,没有使用正确的padding文件。

    所以根据网络上的流程来分析,一定要小心。不可全信




    https://blog.sciencenet.cn/blog-1224852-1252649.html

    上一篇:”生理选择“还是”自然选择“---达尔文的莫逆之交乔治.罗马尼斯
    下一篇:GATK4的新功能GenomicsDBImport
    收藏 IP: 117.175.130.*| 热度|

    0

    该博文允许注册用户评论 请点击登录 评论 (0 个评论)

    数据加载中...

    Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

    GMT+8, 2024-11-23 16:49

    Powered by ScienceNet.cn

    Copyright © 2007- 中国科学报社

    返回顶部