|||
通常真核生物基因组,尤其是植物基因组,一般具有重复序列比例高、多倍化、高杂合等特点,这使得基因组组装难度大大提高。前期短读长的Sanger和二代测序平台获得的测序reads很难跨越基因组的复杂重复区域,导致组装获得的基因组很片段化,高GC区域或基因间隔重复序列区域不能成功获得,而这些信息在某些大型研究项目,如ENCODE(the Encyclopedia of DNA Elements )等是非常重要的。
本研究采用第三代PacBio单分子实时测序技术,对一种极为耐旱的植物Oropetium thomaeum(有翻译成复活草,欢迎指正提供翻译出处,谢谢)基因组开展了全基因组denovo测序和组装,研究成果于2015年11月11日发表在Nature上。Oropetium thomaeum基因组大小约245Mb,基于PacBio RS II平台长读长的优势,组装获得244Mb,即获得>99.6%的基因组序列信息;contigs数量仅265个,contig N50达到2.4Mb。进一步分析表明,Oropetium thomaeum基因组是接近完成级的序列图谱,包括gene space都是无gap的,尤其是在基因组草图中基本上很难获得的端粒、着丝粒、转座子元件以及rRNA clusters都是没有gap的。Oropetium thomaeum是草类基因组中最小的基因组,其中43.8%的序列为重复序列,30%多的紧缩在常染色质区域的;另外其基因组包含28466个蛋白编码基因。
一、基因组特点:
Oropetium thomaeum,虎尾草亚科(Chloridoideae sub-family)、极其耐干旱。
核型:2n = 2x = 18;
基因组大小:1C = 0.25 pg,流式细胞仪预估基因组大小约250Mb;K-mer分析预估基因组大小约245Mb。
二、实验取材:
1. 植株取自印度拉贾斯坦邦的焦特布尔(Jodhpur, Rajasthan, India),扩繁。
2. DNA提取: 基于high-salt phenol–chloroform purification方法进行优化 [优化方法,reference:
Zhang, H.-B.,et al. A. Preparation of megabase-size DNA from plant nuclei. Plant J. (1995).]
3. RNA提取。
三、测序方案:
1. DNA PacBio RS II平台测序:P6-C4试剂盒测序,文库insert size为15-20Kb,32个SMRT cells,基因组整体测序深度约72×。
2. DNA illumina HiSeq平台测序:构建三个不同insert size的文库进行测序:570-bp, 1-kb, 3-kb测序约200X。目的是评估三代组装子的错误率以及基因组的杂合度。
3. DNA Irys system BioNano测序: 构建基因组图谱,目的是对contigs进行anchoring和scaffolding:Irys system测序获53 Gb data(>100Kb),基因组覆盖度约200X,molecule N50为169Kb。
4. RNA illumina HiSeq平台测序。
a. Histogram of length distribution of raw P6C4 chemistry PacBio reads. The mean read length of the raw reads is 12,872 bp, and the N50 is 16,485 bp.
b. Genome size estimation using k -mer distribution. K -mer distribution of unassembled Oropetium Illumina WGS reads.
K -mer frequency displays a unimodal curve indicating a low rate of heterozygosity (0.087%) in the Oropetium genome.
Frequency distribution suggests a genome size of ~245Mb, consistent with flow-cytometry-based estimations.
四、基因组de novo组装:
1. 基因组de novo组装:RS_HGAP_Assembly.3 protocol
C. SMRT sequencing raw read, preassembly and assembly statistics.
The distribution of the contig N50 length ( d ) and scaffold N50 length (e ) of all published plant genomes is plotted. The average
contig N50 length for published plant genomes is ~50 kb compared to 2.4 Mb for Oropetium.
2. 基因组polishing:BLASR v1;Quiver in SMRT Analysis v2.3.0;
3. 其他软件de novo组装:Falcon和MHAP;三个不同组装软件获得的组装子进行比较分析,发现Falcon和MHAP组装子中序列连续性较低,获得着丝粒和端粒区域其平均长度较短。
i. Comparison of HGAP, Falcon and MHAP PacBio assemblers.
4. Irys system构建基因组图谱,对contigs进行anchoring和scaffolding:Irys system测序获53 Gb data(>100Kb),基因组覆盖度约200X,molecule N50为169Kb。将Irys 数据基于不同的严格程度的参数组装获得两套图谱:
map set 1 has 402 maps with an N50 length of 725kb and spans 216Mb;
map set 2 has 214 maps and an N50 of 1.674Mb.
将上述两个map sets和PacBio assembly一起进行拼接,获得a hybrid scaffold,共获得535个scaffolds,scaffold N50为7.1MB,最终组装子大小为244MB。
Assembly improvement using a Bio Nano-based genome map from the Irys system.
a, Distribution of molecule size for raw single molecule genome mapping data. Size of single molecules in nanochannel arrays is
plotted.
b, Integration of the genome map with the genome assembly. Overlap between the PacBio-based contigs and the genome map. Each line shows a single PacBio contig in green; genome maps are shown in light blue.
5. 将illumina short reads比对回最终的assembly上验证最终组装子的准确性、评估杂合度:
5.1. 比对软件:BWA mem (v. 0.7.12-r1039)
5.2. Duplicate alignments: Picard tools v.1.104 MarkDuplicates
(http://broadinstitute.github.io/picard/)
5.3. Call variants: GATK HaplotypeCaller和Genome Analysis Toolkit (v.3.3.0) IndelRealigner。
h , Estimated accuracy of SMRT PacBio assembly and within-genome heterozygosity.
6. 重复序列注释:REPET v.2.2 packages TEdenovo and TEannot
7. 着丝粒和端粒序列分析:Tandem repeat finder (TRF, Version 4.07b)
8. 转录组数据组装及分析:Trinity (v.r20140717)转录组de novo组装, NCBI blastn v.2.2.30+ 比对至最终的组装子上。
9. 基因注释: Maker v2.31.8 ( http://www.yandell-lab.org/software/maker.html) 。
10. 共线性和基因组比较分析: large-scale alignment tool (LAST)
Genome data sets from Setaria, Sorghum, rice and Brachypodium were downloaded from Phytozome (version 9.1) and subject to pairwise genome alignments against the Oropetium genome。
11. 构建基因互作网络图:基于Oropetium基因组中基因和拟南芥基因的同源关系构建基因互作网络图,结果显示
4,421 nodes (gene products) with 36,918 edges (interactions)。这个互作网络图涵盖了Oropetium基因组中绝大部分的代谢通路信息,如光合作用、重要的合成及分解代谢过程以及胁迫条件应答相关的代谢通路信息。
附作者信息:
Robert VanBuren *, Donald Danforth PlantScience Center, St Louis, Missouri 63132, USA.
Doug Bryant *, Donald DanforthPlant Science Center, St Louis, Missouri 63132, USA.
Patrick P. Edger,
Departmentof Plant and Microbial Biology, University of California Berkeley, Berkeley,California 94720, USA.
Departmentof Horticulture, Michigan State University, East Lansing, Michigan 48823, USA.
Haibao Tang, iPlant Collaborative,School of Plant Sciences, University of Arizona, Tucson, Arizona 85721, USA.
Diane Burgess, Department of Plantand Microbial Biology, University of California Berkeley, Berkeley, California94720, USA.
Dinakar Challabathula†, IMBIO, University of Bonn, Kirschallee 1, D-53115 Bonn, Germany.
[†Presentaddress: Department of Life Sciences, School of Basic and Applied Sciences,Central University of Tamil Nadu, Thiruvarur 610101, India]
Kristi Spittle, Pacific Biosciences,Menlo Park, California 94025, USA.
Richard hall, Pacific Biosciences,Menlo Park, California 94025, USA.
Jenny Gu, Pacific Biosciences,Menlo Park, California 94025, USA.
Eric Lyons, iPlant Collaborative,School of Plant Sciences, University of Arizona, Tucson, Arizona 85721, USA.
Michael Freeling, Department of Plantand Microbial Biology, University of California Berkeley, Berkeley, California94720, USA.
Dorothea Bartels, IMBIO, University ofBonn, Kirschallee 1, D-53115 Bonn, Germany.
Boudewijn ten hallers, BioNano Genomics, SanDiego, California 92121, USA.
Alex hastie, BioNano Genomics, SanDiego, California 92121, USA.
Todd P. Michael, Ibis Biosciences, Carlsbad,
Todd C. Mockler, Donald Danforth PlantScience Center, St Louis, Missouri 63132, USA.
*These authors contributed equally to this work.
文献下载:Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum.pdf
SI下载:SI-Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum.pdf
ttwu@macrogencn.com
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-23 10:45
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社