博文

NBT-新年4篇35分文章聚焦宏基因组研究

已有 3954 次阅读 2019-2-10 23:21 |个人分类:新闻|系统分类:科研笔记

新年4篇35分文章聚焦宏基因组研究

Nature Biotechnology (NBT，自然生物技术，IF 35.7)在2019年2月刊(https://www.nature.com/nbt/volumes/37/issues/2
)共发表了8篇研究(Research)论文(包括3篇Letters，3篇Articles，2篇Resources)，其中4篇文章发表了宏基因组学研究进展(2篇Articles+2篇Resources)。其中关于超高速细菌基因组检索的技术作为本期的封面文章。

下面我们对这四篇文章进行简介:

1. 超高速细菌基因组检索技术

Ultrafast search of all deposited bacterial and viral genomic data

来自牛津大学威康人类遗传学信托中心(Wellcome Trust Centre for Human Genetics, University of Oxford)的Zamin Iqbal教授团队在宏基因组数据超高速搜索算法中取得突破进展，可实现全球细菌、病毒基因组的整合、更新和高速索引，新的数据索引方法存储空间较传统方法降低了4个数量级。该研究作为自然生物技术本期封面论文，推荐给读者。

摘要

在全球的生物数据中心，存储的未经处理的细菌和病毒基因组序列数据呈指数级增长。拥有对这些数据进行序列搜索的能力将有助于基础研究和应用研究，如实时基因组流行病学和监测。然而，目前的技术手段仍无法实现。为了解决这一问题，我们将微生物种群基因组学的知识与网络搜索的计算方法相结合，生成一个可搜索的数据结构，即位片基因组签名索引（BItsliced Genomic Signature Index, BIGSI）。我们对来自全球数据库的447,833个细菌和病毒全基因组序列数据集的进行了索引，使用的存储空间比以前的方法减少四个数量级。我们应用BIGSI搜索功能快速寻找耐药基因MCR-1、MCR-2和 MCR-3，确定2827个质粒的宿主范围，并在存档数据集中量化抗生素耐药性。我们的索引可以随着新的（包括未处理或组装的）序列数据集的存储而递增，并且可以扩展至数百万个数据集的级别。

Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as realtime genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.

序列搜索方法

图1. 序列匹配方法。

A，比对序列至同一物种的参考基因组，假设差异相对较低；需要在可接受的时间内比对数百万个序列，并返回一个对齐和比对得分。常用工具为BWA和bowtie。

B，BLAST将一个查询字符串与一个包含大量系统发育范围的参考基因组数据库（图中我们在虚线框中显示RefSeq基因组）进行比较。BLAST 从查询中获取k-mer，对于每个k-mer，它在一个固定的编辑距离内创建一个k-mer的“邻域(neighborhood)”（编辑显示为红色，b（iii）），并在参考基因组数据库中搜索这些。比对只能通过从这些候选位点扩展来完成。BLAST可用于核苷酸和蛋白质的搜索，并能找到近距离同源匹配。

C，MASH在数据库中存储每个参考数据的微小指纹（在本例中是RefSeq）。通过对组装序列集的查询，将组装序列的指纹与RefSeq的指纹进行比较，以找到最接近的参考序列。

D，序列开花树(Sequence Bloom Tree)是一种通过索引数据中的k-mers，然后压缩索引来搜索原始未组装的序列集（未组装的序列集显示为“堆(piles)”的序列（短线），所有这些序列的颜色都相同，表示相同的种类）。设计用于人类数据，SBT可以用来寻找哪些RNA测序数据集包含指定的转录本。

E，BIGSI可以搜索完整的细菌和病毒原始序列数据。RefSeq显示在未组装的readset之间的虚线框中；不同的颜色表示物种和门的巨大范围。SBT和BIGSI的不同输入数据意味着这些方法具有不同的速度和压缩的权衡考虑。

Fig. 1 | Sequence matching methods.
a, Mapping of sequence reads to a reference genome from the same species, assuming relatively low divergence; requirement to map millions of reads in acceptable time and return an alignment and mapping score. Common tools: bwa and bowtie.
b, BLAST compares a query string with a database of reference genomes (in the figure we show RefSeq genomes in a dotted box) covering a massive phylogenetic range. BLAST takes k-mers from the query, and for each k-mer it creates a ‘neighborhood’ of k-mers within a fixed edit distance (edits are shown in red, b(iii)), and searches for these in the reference genome database. Alignment is only done by extending from these hits. Blast can be applied to nucleotide and protein searches and can find close and remote homology matches.
c, MASH stores a tiny fingerprint of each reference in the database (in this case RefSeq). Querying with an assembly, the fingerprint of the assembly is compared with that of RefSeq to find the closest reference.
d, Sequence Bloom Tree13 was the first scalable method to search through raw unassembled readsets (unassembled readsets are shown as ‘piles’ of reads (short lines), all in same color to signify same species), by indexing the k-mers in the data and then compressing the index. Designed for human data, SBT can be applied to find which RNA-seq datasets contain a given transcript.
e, BIGSI can search the complete set of raw sequence data for bacteria and viruses. RefSeq is shown in a dotted box amongst unassembled readsets; different colors to signify the massive range of species and phyla. The different input data for SBT and BIGSI mean that these methods have different speed and compression trade-offs.

2. 宏基因组中设计全面可扩展探针捕获序列多样性

Capturing sequence diversity in metagenomes with comprehensive and scalable probe design

来自哈佛和麻省理工联合博德研究所(Broad Institute)的Hayden C. Metsky和Katherine J. Siddle团队在宏基因组数据中的探针设计方法取得突破进展，可实现完整病毒基因组探针的设计，高效用于病毒检测、序列捕获，有助于实现更敏感和更经济有效的宏基因组捕获测序。

摘要

宏基因组测序结果有应用于微生物检测和鉴定的潜力，但需要新的工具来提高其敏感性。在这里，我们提出了一种计算方法——CATCH，以增强核酸捕获丰富的各种微生物类群。CATCH可设计具有指定数量的寡核苷酸的最佳探针集，可实现已知序列多样性的完全覆盖和扩展。我们致力于在复杂的宏基因组样本中应用CATCH来捕获病毒基因组。我们设计、合成和验证多个探针集，包括一个针对356种已知感染人类病毒全基因组的探针集。用这些探针集捕获的病毒平均含量增加了18倍，这使得我们能够组装那些不浓缩就无法恢复的基因组，并准确地保存在样本多样性中。我们还使用这些探针组恢复2018年尼日利亚拉沙热爆发的基因组，并改进人类和蚊子样本中未鉴定病毒感染的检测。结果表明，CATCH可以实现更敏感和更经济有效的宏基因组测序。

Metagenomic sequencing has the potential to transform microbial detection and characterization, but new tools are needed to improve its sensitivity. Here we present CATCH, a computational method to enhance nucleic acid capture for enrichment of diverse microbial taxa. CATCH designs optimal probe sets, with a specified number of oligonucleotides, that achieve full coverage of, and scale well with, known sequence diversity. We focus on applying CATCH to capture viral genomes in complex metagenomic samples. We design, synthesize, and validate multiple probe sets, including one that targets the whole genomes of the 356 viral species known to infect humans. Capture with these probe sets enriches unique viral content on average 18-fold, allowing us to assemble genomes that could not be recovered without enrichment, and accurately preserves within-sample diversity. We also use these probe sets to recover genomes from the 2018 Lassa fever outbreak in Nigeria and to improve detection of uncharacterized viral infections in human and mosquito samples. The results demonstrate that CATCH enables more sensitive and cost-effective metagenomic sequencing.

CATCH设计探针

图1. 使用CATCH设计探针组。

a，CATCH探针设计方法的概述，显示了三个数据集（通常，每个数据集都是一个分类单元）。对于每个数据集d，CATCH通过跨输入基因组平铺(tiling)来生成候选探针，并且可以选择使用位置敏感散列来减少候选探针的数量。然后确定每个候选探针在参数为θd的模型下杂交（基因组和其中的区域）的位置（详见补充图1b）。使用这些覆盖率曲线近似于完全捕获所有输入基因组的最小探针集合（在文本中描述为s（d，θd））。考虑到探针总数（n）的限制和θd上的损失函数，它搜索d所有的最佳θd.

b，完全捕获不断增加的HCV基因组所需的探针数量。所示的方法是简单的平铺（灰色），一种基于聚类的方法，在两个严格级别（红色）上，并使用三个参数值选择捕获，这些参数值指定不同的严格级别（蓝色）。参数选择详见补充说明2。以前针对病毒多样性的方法在探针集设计中使用聚类。每一行周围的阴影区域是随机抽样输入基因组计算的95%点置信区间。

c，CATCH为VALL探针集所有349,998个探针中的每个数据集（共296个数据集）设计的探针数。我们的样本测试中包含的物种都有标签。

d，CATCH为VALL设计中的每个数据集选择的两个参数值：假设在杂交中允许不匹配数量和杂交区域每侧的目标片段长度（以核苷酸为单位）。每个气泡的标签和大小指示分配给特定值组合的数据集数量。样本测试中包含的物种用黑色标记，未包含在测试中的异常物种用灰色标记。一般来说，多样性更高的病毒（例如，HCV和HIV-1）被分配的参数值（这里是高值）比多样性低的病毒更宽松，但在设计中仍然需要相对大量的探针来覆盖已知的多样性（见C）。用于设计VWAFR探针集时，类似于c和d的图在补充图3中。

Fig. 1 | Using CATCH for probe set design.
a, Sketch of CATCH’s approach to probe design, shown with three datasets (typically, each is a taxon). For each dataset d, CATCH generates candidate probes by tiling across input genomes and, optionally, reduces the number of them using locality-sensitive hashing. Then it determines a profile of where each candidate probe will hybridize (the genomes and regions within them) under a model with parameters θd (see Supplementary Fig. 1b for details). Using these coverage profiles, it approximates the smallest collection of probes that fully captures all input genomes (described in the text as s(d, θd)). Given a constraint on the total number of probes (N) and a loss function over θd, it searches for the optimal θd for all d.
b, Number of probes required to fully capture increasing numbers of HCV genomes. Approaches shown are simple tiling (gray), a clustering-based approach at two levels of stringency (red), and CATCH with three choices of parameter values specifying varying levels of stringency (blue). See Supplementary Note 2 for details regarding parameter choices. Previous approaches for targeting viral diversity use clustering in probe set design. The shaded regions around each line are 95% pointwise confidence bands calculated across randomly sampled input genomes.
c, Number of probes designed by CATCH for each dataset (of 296 datasets in total) among all 349,998 probes in the VALL probe set. Species incorporated in our sample testing are labeled.
d, Values of the two parameters selected by CATCH for each dataset in the design of VALL: number of mismatches to tolerate in hybridization and length of the target fragment (in nucleotides) on each side of the hybridized region assumed to be captured along with the hybridized region (cover extension). The label and size of each bubble indicate the number of datasets that were assigned a particular combination of values. Species included in our sample testing are labeled in black, and outlier species not included in our testing are in gray. In general, more diverse viruses (for example, HCV and HIV-1) are assigned more relaxed parameter values (here, high values) than less diverse viruses, but still require a relatively large number of probes in the design to cover known diversity (see c). Panels similar to c and d for the design of VWAFR are in Supplementary Fig. 3.

3. 1520个人类肠道可培养细菌基因组使微生物组功能分析成为可能

1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses

2019年2月5日上午，华大团队在国际顶级学术期刊Nature旗下子刊Nature Biotechnology上发表了全球最大人体肠道细菌基因组集（Culturable GenomeReference, CGR）研究成果。该研究提供了1500多个高质量的人体肠道细菌基因组，为肠道微生物组研究提供了大量全新的参考基因组数据，同时将肠道菌群的功能分析提升到新维度，这也是首次通过大规模培养的技术手段获得如此多数量的高质量细菌基因组数据。这项由深圳华大生命科学研究院宏基因组学研究团队主导构建的人肠道细菌基因组集及菌株库，对于实现精准解密肠道菌群与疾病之间的关系具有重要的科研价值，同时也为人肠道菌株功能的深入探索提供了宝贵的基础资源。

更多相关报导，详见《NBT-2019-华大发布全球最大人体肠道细菌基因组集研究成果》

摘要

参考基因组对于人类肠道微生物群的宏基因组分析和功能特征描述是必不可少的。我们提供了可培养基因组参考（Culturable Genome Reference，CGR），这是一个1520个非冗余的、高质量的基因组草图，由健康人粪便样本中培养出的超过6000个细菌获得。1520个基因组覆盖人类肠道所有主要细菌门和属的，其中264个没有出现在现有的参考基因组目录中。进一步研究表明，细菌参考基因组数量的增加提高了宏基因组测序数据的可比对率，从50%提高到70%，使人类肠道微生物组的分辨率更高。我们利用CGR基因组对338种细菌的功能进行了注释，表明该资源在功能研究中的有效性。我们还对38种重要的人类肠道物种进行了全基因组分析，揭示了它们的核心基因组与其它可有可无的基因组之间功能富集的多样性和特异性。

Reference genomes are essential for metagenomic analyses and functional characterization of the human gut microbiota. We present the Culturable Genome Reference (CGR), a collection of 1,520 nonredundant, high-quality draft genomes generated from >6,000 bacteria cultivated from fecal samples of healthy humans. Of the 1,520 genomes, which were chosen to cover all major bacterial phyla and genera in the human gut, 264 are not represented in existing reference genome catalogs. We show that this increase in the number of reference bacterial genomes improves the rate of mapping metagenomic sequencing reads from 50% to >70%, enabling higher-resolution descriptions of the human gut microbiome. We use the CGR genomes to annotate functions of 338 bacterial species, showing the utility of this resource for functional studies. We also carry out a pan-genome analysis of 38 important human gut species, which reveals the diversity and specificity of functional enrichment between their core and dispensable genomes.

肠道细菌系统发育树

图1. 基于全基因组序列的1520株肠道细菌系统发育树。CGR中1520个高质量基因组根据其全基因组序列分为338个种级簇（ANI≥95%）。厚壁菌门的细菌以橙色表示；拟杆菌门，蓝色；变形菌门，绿色；放线菌门，紫色；梭菌门，灰色。新属和新种分别以红色和橙色枝突出。最外层的条表示每个簇中获得的基因组数量。以硒化根瘤菌ATCC BAA 1503为外类群进行系统发育树构建。

Fig. 1 | Phylogenetic tree of 1,520 isolated gut bacteria based on whole-genome sequences. The 1,520 high-quality genomes in CGR are classified into 338 species-level clusters (ANI ≥ 95%) based on their whole-genome sequences. Bacterial species from Firmicutes are colored in orange; Bacteroidetes, blue; Proteobacteria, green; Actinobacteria, violet; Fusobacteria, gray. Novel genera and species are highlighted by red and orange branches, respectively. The bar on the outermost layer indicates the number of genomes archived in each cluster. Rhizobium selenitireducens ATCC BAA 1503 was used as an outgroup for phylogenetic analysis.

4. 人类肠道细菌基因组和培养组用于改进的宏基因组分析

A human gut bacterial genome and culture collection for improved metagenomic analyses

来自桑格研究所(Wellcome Sanger Institute)宿主与微生物组互作实验室(Host-Microbiota Interactions Laboratory)的Trevor D. Lawley团队发布了人类胃肠道细菌培养的737个全基因组测序细菌分离株。这一资源的发布，使人类胃肠道微生物组的细菌基因组数量增加了37%。比HMP基因组数据集分类比例提高了61%，有助于实现非组装的快速宏基因组功能基因定量。本研究与上篇华大的培养组学研究工作类似，背靠背同期发布于NBT杂志的研究论文的资源栏目。

摘要

了解肠道微生物群的功能需要培养细菌进行实验验证，并参考细菌基因组序列来解释宏基因组数据集并指导功能分析。我们介绍了人类胃肠道细菌培养集（Human Gastrointestinal Bacteria Culture Collection, HBC），这是一套完整的737个全基因组测序细菌分离株，来自人类胃肠道微生物组中31个科的273个物种（105个新物种）。HBC使人类胃肠道微生物组的细菌基因组数量增加了37%。由此产生的全球人类胃肠道细菌基因组资源库（HGG）测试13,490个鸟枪测序的宏基因组样本，可对其中83%的属进行分类。与人类微生物组项目（HMP）基因组数据集相比，分类比例提高了61%，并实现了近50%序列的亚种级分类。改进的胃肠道细菌参考序列资源避免了对宏基因组从头组装的依赖，并使人胃肠道微生物组的宏基因组分析更准确、且经济有效。

Understanding gut microbiome functions requires cultivated bacteria for experimental validation and reference bacterial genome sequences to interpret metagenome datasets and guide functional analyses. We present the Human Gastrointestinal Bacteria Culture Collection (HBC), a comprehensive set of 737 whole-genome-sequenced bacterial isolates, representing 273 species (105 novel species) from 31 families found in the human gastrointestinal microbiota. The HBC increases the number of bacterial genomes derived from human gastrointestinal microbiota by 37%. The resulting global Human Gastrointestinal Bacteria Genome Collection (HGG) classifies 83% of genera by abundance across 13,490 shotgun-sequenced metagenomic samples, improves taxonomic classification by 61% compared to the Human Microbiome Project (HMP) genome collection and achieves subspecies-level classification for almost 50% of sequences. The improved resource of gastrointestinal bacterial reference sequences circumvents dependence on de novo assembly of metagenomes and enables accurate and cost-effective shotgun metagenomic analyses of human gastrointestinal microbiota.

胃肠道细菌系统发育树

图1. 人类胃肠道微生物组基因组可培养细菌的系统发育多样性。最大似然树的40个通用核心基因，是由737个HBC基因组（绿色外圆）和617个来自人类胃肠道样本的高质量公共基因组共同构成。分支颜色区分不同菌门：放线菌门（金；n=129）、拟杆菌门（绿色；n=231）、厚壁菌门（蓝色；n=772）、梭杆菌门（黑色；n=26）、互养菌门（粉红色；n=2）和变形菌门（橙色；n=194）。

Fig. 1 | Phylogenetic diversity of the human gastrointestinal microbiota genome collection. Maximum-likelihood tree generated using the 40 universal core genes from the 737 HBC genomes (green outer circle) and the 617 high-quality public genomes derived from human gastrointestinal tract samples, which together make up the HGG. Branch color distinguishes bacterial phyla belonging to Actinobacteria (gold; n = 129 genomes), Bacteroidetes (green; n = 231 genomes), Firmicutes (blue; n = 772 genomes), Fusobacteria (black; n = 26 genomes), Synergistetes (pink; n = 2 genomes) and Proteobacteria (orange; n = 194 genomes) shown.

Reference

Bradley Phelim,den Bakker Henk C,Rocha Eduardo P C et al. Ultrafast search of all deposited bacterial and viral genomic data.[J] .Nat. Biotechnol., 2019, 37: 152-159.
Metsky Hayden C,Siddle Katherine J,Gladden-Young Adrianne et al. Capturing sequence diversity in metagenomes with comprehensive and scalable probe design.[J] .Nat. Biotechnol., 2019, 37: 160-168.
Zou Yuanqiang,Xue Wenbin,Luo Guangwen et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses.[J] .Nat. Biotechnol., 2019, 37: 179-185.
Forster Samuel C,Kumar Nitin,Anonye Blessing O et al. A human gut bacterial genome and culture collection for improved metagenomic analyses.[J] .Nat. Biotechnol., 2019, 37: 186-192.

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外5000+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍末解决群内讨论，问题不私聊，帮助同行。

学习扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA

转载本文请联系原作者获取授权，同时请注明本文来自刘永鑫科学网博客。
链接地址：https://blog.sciencenet.cn/blog-3334560-1161560.html

上一篇：如何简化美化LEfSe分析结果中的Cladogram图
下一篇：[转载]使用ComplexHeatmap包绘制个性化热图

woodcorpse的个人博客分享 http://blog.sciencenet.cn/u/woodcorpse

博文

NBT-新年4篇35分文章聚焦宏基因组研究

新年4篇35分文章聚焦宏基因组研究

1. 超高速细菌基因组检索技术