Capturing sequence diversity in metagenomes with comprehensive and scalable probe design

Nature Biotechnology, [35.724]


原文链接: https://www.nature.com/articles/s41587-018-0006-x

第一作者:Hayden C Metsky, Katherine J Siddle

其它作者:Adrianne Gladden-Young,James Qu,David K Yang,Patrick Brehio,Andrew Goldfarb,Anne Piantadosi,Shirlee Wohl,Amber Carter,Aaron E Lin,Kayla G Barnes,Damien C Tully,Bjӧrn Corleis,Scott Hennigan,Giselle Barbosa-Lima,Yasmine R Vieira,Lauren M Paul,Amanda L Tan,Kimberly F Garcia,Leda A Parham,Ikponmwosa Odia,Philomena Eromon,Onikepe A Folarin,Augustine Goba,Etienne Simon-Lorière,Lisa Hensley,Angel Balmaseda,Eva Harris,Douglas S Kwon,Todd M Allen,Jonathan A Runstadler,Sandra Smole,Fernando A Bozza,Thiago M L Souza,Sharon Isern,Scott F Michael,Ivette Lorenzana,Lee Gehrke,Irene Bosch,Gregory Ebel,Donald S Grant,Christian T Happi,Daniel J Park,Andreas Gnirke,Pardis C Sabeti,Christian B Matranga

主要单位:哈佛和麻省理工联合博德研究所(Broad Institute)


Nature Biotechnology (NBT,自然生物技术,IF 35.7)在2019年2月刊(https://www.nature.com/nbt/volumes/37/issues/2


本文来自哈佛和麻省理工联合博德研究所(Broad Institute)的Hayden C. Metsky和Katherine J. Siddle团队在宏基因组数据中的探针设计方法取得突破进展,可实现完整病毒基因组探针的设计,高效用于病毒检测、序列捕获,有助于实现更敏感和更经济有效的宏基因组捕获测序。








Metagenomic sequencing has the potential to transform microbial detection and characterization, but new tools are needed to improve its sensitivity. Here we present CATCH, a computational method to enhance nucleic acid capture for enrichment of diverse microbial taxa. CATCH designs optimal probe sets, with a specified number of oligonucleotides, that achieve full coverage of, and scale well with, known sequence diversity. We focus on applying CATCH to capture viral genomes in complex metagenomic samples. We design, synthesize, and validate multiple probe sets, including one that targets the whole genomes of the 356 viral species known to infect humans. Capture with these probe sets enriches unique viral content on average 18-fold, allowing us to assemble genomes that could not be recovered without enrichment, and accurately preserves within-sample diversity. We also use these probe sets to recover genomes from the 2018 Lassa fever outbreak in Nigeria and to improve detection of uncharacterized viral infections in human and mosquito samples. The results demonstrate that CATCH enables more sensitive and cost-effective metagenomic sequencing.

图1. 使用CATCH设计探针组






Fig. 1 | Using CATCH for probe set design.
a, Sketch of CATCH’s approach to probe design, shown with three datasets (typically, each is a taxon). For each dataset d, CATCH generates candidate probes by tiling across input genomes and, optionally, reduces the number of them using locality-sensitive hashing. Then it determines a profile of where each candidate probe will hybridize (the genomes and regions within them) under a model with parameters θd (see Supplementary Fig. 1b for details). Using these coverage profiles, it approximates the smallest collection of probes that fully captures all input genomes (described in the text as s(d, θd)). Given a constraint on the total number of probes (N) and a loss function over θd, it searches for the optimal θd for all d.
b, Number of probes required to fully capture increasing numbers of HCV genomes. Approaches shown are simple tiling (gray), a clustering-based approach at two levels of stringency (red), and CATCH with three choices of parameter values specifying varying levels of stringency (blue). See Supplementary Note 2 for details regarding parameter choices. Previous approaches for targeting viral diversity use clustering in probe set design. The shaded regions around each line are 95% pointwise confidence bands calculated across randomly sampled input genomes.
c, Number of probes designed by CATCH for each dataset (of 296 datasets in total) among all 349,998 probes in the VALL probe set. Species incorporated in our sample testing are labeled.
d, Values of the two parameters selected by CATCH for each dataset in the design of VALL: number of mismatches to tolerate in hybridization and length of the target fragment (in nucleotides) on each side of the hybridized region assumed to be captured along with the hybridized region (cover extension). The label and size of each bubble indicate the number of datasets that were assigned a particular combination of values. Species included in our sample testing are labeled in black, and outlier species not included in our testing are in gray. In general, more diverse viruses (for example, HCV and HIV-1) are assigned more relaxed parameter values (here, high values) than less diverse viruses, but still require a relatively large number of probes in the design to cover known diversity (see c). Panels similar to c and d for the design of VWAFR are in Supplementary Fig. 3.

图2. 捕获后宏基因组分布的改变,基因组覆盖和装配改善



b, 在两个样本中测序整个DENV基因组的深分布度。DENV-SM3(左图)在捕获前的信息性序列很少,不产生基因组组装,但在捕获后可行。DENV-SM5(右)的确在捕获前产生了一个基因组组装,捕获后深度增加。

c, 在30个样本中,每个病毒基因组的百分比都清晰地聚集在一起,其中有8个已知的病毒感染。捕获前(橙色)、捕获后用VWAFR(浅蓝色)和捕获后用VALL(深蓝色)显示。样本下面的红色条表明,我们在捕获前无法装配任何重叠群,但在捕获后我们能够装配至少部分基因组(>50%)。


Fig. 2 | Improvement in genome coverage and assembly, and shift in metagenomic distribution after capture.

a, Distribution of the enrichment in read depth, across viral genomes, provided by capture with VALL on 30 patient and environmental samples with known viral infections. Each curve represents one of the 31 viral genomes sequenced here (one sample contained two known viruses). At each position across a genome, the post-capture read depth is divided by the pre-capture depth, and the plotted curve is the empirical cumulative distribution of the log of these fold-change values. A curve that rises fully to the right of the black vertical line illustrates enrichment throughout the entirety of a genome; the more vertical a curve, the more uniform the enrichment. Read depth across viral genomes DENV-SM3 (purple) and DENV-SM5 (green) is shown in more detail in b.

b, Read depth throughout the DENV genome in two samples. DENV-SM3 (left) has few informative reads before capture and does not produce a genome assembly, but does following capture. DENV-SM5 (right) does yield a genome assembly before capture, and depth increases following capture.

c, Percent of each viral genome unambiguously assembled in the 30 samples, which had eight known viral infections across them. Shown before capture (orange), after capture with VWAFR (light blue), and after capture with VALL (dark blue). Red bars below samples indicate ones in which we could not assemble any contig before capture but in which, following capture, we were able to assemble at least a partial genome (> 50%).

d, Left, number of reads detected for each species across the 30 samples with known viral infections, before and after capture with VALL. Reads in each sample were downsampled to 200,000 reads. Each point represents one species detected in one sample. For each sample, the virus previously detected in the sample by another assay is colored. Homo sapiens matches in samples from humans are shown in black. Right, abundance of each detected species before capture and fold change upon capture with VALL for these samples. Abundance was calculated by dividing pre-capture read counts for each species by counts in pooled water controls. Coloring of human and viral species is as in the left panel.

图3. 改进样品多样性的检测和保留



b,探测-目标相似性和测序深度与富集程度之间的关系,如IAV样品用VALL和VWAFR捕获后在H4N4 (IAV-SM5)上结果。每个点代表IAV基因组中的一个窗口。探针和组装的H4N4序列之间的一致性是该窗口中的序列与其对应的探针序列的前25%之间的一致性度量(有关详细信息,请参见方法)。富集倍数的变化在窗口上取平均值。在VALL和VWAFR的设计中没有包括N4亚型的第6段(N)的序列。

c, 捕获对样本内共感染估计频率的影响。将2种、4种、6种和8种病毒的RNA加入健康人血浆中提取的RNA中,然后用VALL和VWAFR捕获。上面的值是所有病毒性序列数量的百分比。MEV为麻疹病毒,MES为中东呼吸综合征冠状病毒,MARV为马尔堡病毒,NIV为尼帕病毒。我们没有使用VWAFR探针集检测到Niv,因为该设计中不存在这种病毒。


Fig. 3 | Characterizing improvement in detection and preservation of within-sample diversity.

a, Amount of viral material sequenced in a dilution series of viral input in two amounts of human RNA background. There are n = 2 technical replicates for each choice of input copies, background amount, and use of capture (n = 1 replicate for the negative control with 0 copies). Each dot indicates the number of unique viral reads, among 200,000 in total, sequenced from a replicate; the line is through the mean of the replicates. The label to the right of each line indicates the amount of background material.

b, Relationship between probe–target identity and enrichment in read depth, as seen after capture with VALL and with VWAFR on an IAV sample of subtype H4N4 (IAV-SM5). Each point represents a window in the IAV genome. Identity between the probe and assembled H4N4 sequence is a measure of identity between the sequence in that window and the top 25% of probe sequences that map to it (see Methods for details). Fold change in depth is averaged over the window. No sequences of segment 6 (N) of the N4 subtypes were included in the design of VALL or VWAFR.

c, Effect of capture on the estimated frequency of within-sample co-infections. RNA of 2, 4, 6, and 8 viral species was spiked into RNA extracted from healthy human plasma and then captured with VALL and with VWAFR. Values on top are the percent of all sequenced reads that are viral. MeV is measles virus, MERS is Middle East respiratory syndrome coronavirus, MARV is Marburg virus, and NiV is Nipah virus. We did not detect NiV using the VWAFR probe set because this virus was not present in that design.

d, Effect of capture on the estimated frequency of within-host variants, shown in positions across three DENV samples: DENV-SM1, DENV-SM2, and DENV-SM5. Capture with VALL and VWAFR was performed on n = 2 replicates of the same library. ρC indicates the concordance correlation coefficient between the pre- and post-capture frequencies.

图4. 利用捕获技术的基因组应用:2018年拉沙热爆发未分类样本中感染的测序


a. 2018年拉沙热爆发的23个样本中,在使用VALL后收集了拉沙病毒基因组组装的百分比。在装配前,读数被降低到200,000次。按照样品来自尼日利亚的来源着色,柱状图展示组装的比例。

b,在用VALL捕获后,存在于未知的蚊子群体和来自尼日利亚和塞拉利昂的人血浆样本中的病毒种类。物种上的星号表示那些不VALL所捕获的。检测到的病毒包括尤马蒂病毒(UMAV)、Alphamesonivirus 1 (AMNV1))、西尼罗河病毒(WNV)、库蚊黄病毒(CxFV)、GBV-C、乙型肝炎病毒(HBV)、LASV和EBOV。

c, 捕获前检测到的所有物种的丰度,以及在未知样品池中用VALL捕获后的倍数变化。如图2d所示计算丰度。每个样本中存在的病毒种类(见b)都被着色,人类血浆样本中的人源匹配物以黑色显示。

Fig. 4 | Genomic applications using capture: sequencing from the 2018 Lassa fever outbreak and of infections in uncharacterized samples.

a, Percent of the LASV genome assembled, after use of VALL, among 23 samples from the 2018 Lassa fever outbreak. Reads were downsampled to 200,000 reads before assembly. Bars are ordered by amount assembled and colored by the state in Nigeria that the sample is from.

b, Viral species present in uncharacterized mosquito pools and pooled human plasma samples from Nigeria and Sierra Leone after capture with VALL. Asterisks on species indicate ones that are not targeted by VALL. Detected viruses include Umatilla virus (UMAV), Alphamesonivirus 1 (AMNV1), West Nile virus (WNV), Culex flavivirus (CxFV), GBV-C, hepatitis B virus (HBV), LASV, and EBOV.

c, Abundance of all detected species before capture and fold change upon capture with VALL in the uncharacterized sample pools. Abundance was calculated as described in Fig. 2d. Viral species present in each sample (see b) are colored, and H. sapiens matches in the human plasma samples are shown in black.


  1. Metsky Hayden C,Siddle Katherine J,Gladden-Young Adrianne et al. Capturing sequence diversity in metagenomes with comprehensive and scalable probe design.[J] .Nat. Biotechnol., 2019, 37: 160-168.



