||
从人类基因计划提出二十年来,人类基因组序列越来越完善,特别是近几年,随着三代测序技术的发展,人类基因组版本不断更新,例如biorxiv上最新发布的一篇文章<The complete sequence of a human genome>,是Telomere-to-Telomere (T2T)联盟将人类基因组中22条常染色体和X染色体做到了零gap (gapless)水平!后续的革新还会继续,但为了获得一个稳定的版本(可能不那么全),早在2009年研究者们就统一了一个版本,即我们将要讲的hg19(在NCBI的版本中编号为GRCh37)。目前仍有大量的文章采用这个版本的基因组,特别是一些技术刚开发出来时都采用这个版本进行数据分析,以方便与之前的技术比较,例如promoter capture Hi-C设计探针时采用hg19,所以获取到hg19的序列和对应的注释结果依然非常重要。
1. NCBI RefSeq
hg19/GRCh37及refSeq版本的基因注释结果下载自NCBI如下路径:
https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml
这个网页可以看到是RefSeq版。具体下载地址:
https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz
https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.fna.gz
另外,从GCF_000001405.25_GRCh37.p13 NCBI ftp路径:
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20201022/GCF_000001405.25_GRCh37.p13/
可下载到
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20201022/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20201022/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
比较下载的GRCh37_latest_genomic.gff.gz和GCF_000001405.25_GRCh37.p13_genomic.gff.gz,以及
GRCh37_latest_genomic.fna.gz和GCF_000001405.25_GRCh37.p13_genomic.fna.gz。发现他们对应文件的md5值一致,表明这两处下载的是相同的文件。
2. UCSC goldenPath:
从UCSC goldenPath也可以下载到hg19:
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
经过比较,hg19.fa.gz中的染色体部分的序列与GCF_000001405.25_GRCh37.p13_genomic.fna.gz中染色体部分的序列完全一致。
对应的注释有几个版本,但都是gtf格式的:
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ensGene.gtf.gz
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz
3. ENCODE GENCODE
这个是专门为ENCODE计划开发的人类基因组注释版本,可以从以下地址下载到:
https://www.gencodegenes.org/human/release_19.html
hg19基因组具体地址为:
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz
经比较这个基因组染色体部分的序列与GCF_000001405.25_GRCh37.p13_genomic.fna.gz中染色体部分的序列完全一致,序列名改以chr开头
对应的注释文件v19版,下载地址:
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz
在ENCODE网页中也提供了下载路径:
https://www.encodeproject.org/files/gencode.v19.annotation/@@download/gencode.v19.annotation.gtf.gz
4. GENCODE与RefSeq比较
目前,GENCODE提供的人类基因组注释版本是release_39lift37 (https://www.gencodegenes.org/human/release_39lift37.html),它包括两个版本:一个是详尽版(Comprehensive gene annotation)和一个核心版的(Basic gene annotation),后者是前者的子集,仅包括染色体部分而且是protein-coding gene的full-length和protein-coding transcripts。
Sanger研究所Adam Frankish甚至专门写了一篇文章比较GENCODE v21注释版和RefSeq版的结果(Sanger研究所参与GENCODE):
Adam Frankish. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. 2015. BMC Genomics.
该文章得出的结论是:
The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
另外,文章中还提到:
(1) GENCODE is the default annotation used by the Ensembl project, and the terms ‘Ensembl annotation’ and ‘GENCODE annotation’ are thus synonymous when referring to human.
(2) GENCODE has a higher proportion of manually annotated gene models than RefSeq and includes more novel splicing features.
总而言之,GENCODE的注释结果比RefSeq更准确,特别是可变剪切。
另外,我也比较了一下GCF_000001405.25_GRCh37.p13_genomic.gff.gz和gencode.v19.annotation.gff3.gz染色体部分的注释条目数:
NCBI RefSeq | GENCODE v19 | |
gene | 27955 | 57820 (KNOWN=39995; NOVEL=17544; PUTATIVE=281) |
transcript | mRNA=60617; transcript=8214 | 196520 |
CDS | 651848 | 724078 |
exon | 846661 | 1196293 |
lnc_RNA | 7276 | - |
miRNA | 2851 | - |
enhancer | 1914 | - |
total record | 1675441 | 2615566 |
上表中只是列出来几个常见的feature,而total record是gff文件中的所有信息记录(行数)。从数量上比较GENCODE v19的gene/transcript/CDS/exon都远超RefSeq版的,最终的信息记录也是远比RefSeq多(1.56倍),虽然RefSeq还注释出一些miRNA、primary_transcript、tRNA等feature,种类非常多,但是每种数目比较少。GENCODE v19中KNOWN的基因和RefSeq中的基因做overlap,如下
结果意外发现GENCODE中的基因有重复的情况,一个基因可能有两种注释结果,例如ZNF8基因提供了HAVANA和ENSEMBL两个版本的注释结果,它们在位置上只有1bp的差别。
chr19 HAVANA gene 58790317 58807254 . + . ID=ENSG00000083842.8;gene_id=ENSG00000083842.8;transcript_id=ENSG00000083842.8;gene_type=protein_coding;gene_status=KNOWN;gene_name=ZNF8;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ZNF8;level=2;havana_gene=OTTHUMG00000182073.1
chr19 ENSEMBL gene 58790318 58807254 . + . ID=ENSG00000273439.1;gene_id=ENSG00000273439.1;transcript_id=ENSG00000273439.1;gene_type=protein_coding;gene_status=KNOWN;gene_name=ZNF8;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ZNF8;level=3
在RefSeq中也发现了基因有重复的情况,例如基因CD99同时出现在chrX与chrY上(类似基因共有25个),这个现象在GENCODE中也存在(类似基因共有33个,其中22个基因名与RefSeq一致)
chrX BestRefSeq gene 2609336 2659350 . + . ID=gene-CD99;Dbxref=GeneID:4267,HGNC:HGNC:7082,MIM:450000;Name=CD99;description=CD99 molecule (Xg blood group);gbkey=Gene;gene=CD99;gene_biotype=protein_coding;gene_synonym=HBA71,MIC2,MIC2X,MIC2Y,MSK5X
chrY BestRefSeq gene 2559336 2609350 . + . ID=gene-CD99-2;Dbxref=GeneID:4267,HGNC:HGNC:7082,MIM:450000;Name=CD99;description=CD99 molecule (Xg blood group);gbkey=Gene;gene=CD99;gene_biotype=protein_coding;gene_synonym=HBA71,MIC2,MIC2X,MIC2Y,MSK5X
5. 文献中的倾向
CNS主刊文章更倾向于使用GENCODE版本的注释结果:
[1] Zhaokui Cai. RIC-seq for global in situ profiling of RNA–RNA spatial interactions. 2020. Nature (RIC-seq引文) (GENCODE v19)
[2] Hyejung Won. Chromosome conformation elucidates regulatory relationships in developing human brain. 2016. Nature (GENCODE v19)
[3] Fabian Grubert. Landscape of cohesin-mediated chromatin loops in the human genome. 2020. Nature. (GENCODE v25, lifted to GRCh37 coordinates)
[4] Yan Li. The structural basis for cohesin–CTCF-anchored loops. 2019. Nature. (GENCODE v19)
[5] O. Delaneau. Chromatin three-dimensional interactions mediate genetic effects on gene expression. Science. 2019 (GENCODE v19)
[6] Silvia Domcke. A human cell atlas of fetal chromatin accessibility. Science. 2020 (gencode.v29lift37)
相比之下,采用NCBI RefSeq注释结果的CNS主刊文章相对较少:
[1] James H. Sun. Disease-Associated Short Tandem Repeats Colocalize with Chromatin Domain Boundaries. 2018. Cell
但是子刊比较多:
[1] Jesse R. Dixon. Integrative detection and analysis of structural variation in cancer genomes. 2018. Nature Genetics
[2] Xiaotian Zhang. Large DNA Methylation Nadirs Anchor Chromatin Loops Maintaining Hematopoietic Stem Cell Identity. 2020. Molecular Cell
[3] Pengze Wu. 3D genome of multiple myeloma reveals spatial genome disorganization associated with copy number variations. 2017. Nature Communications
需要注意的是第一篇PCHiC的文章采用RefSeq,基于这篇文章中探针的文章多采用RefSeq:
[1] Borbala Mifsud. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. 2015. Nature Genetics(第一篇PCHiC)
[2] Joseph S. Baxter. Capture Hi-C identifies putative target genes at 33 breast cancer risk loci. 2018. Nature Communications
[3] Nan Zhang. Muscle progenitor specification and myogenic differentiation are associated with changes in chromatin topology. 2020. Nature Communications
但是也有不一致的,例如:
[1] Inkyung Jung. A compendium of promoter-centered long-range chromatin interactions in the human genome. 2019. Nature Genetics (自行设计探针,采用GENCODE v19版中confidence levels 1和2的注释结果)
[2] Michael Song. Mapping cis-regulatory chromatin contacts in neural cells links neuropsychiatric disorder risk variants to target genes. 2019. Nature Genetics (自行设计探针,采用GENCODE v19)
总结:
(1) 在以上所有网页下载的hg19基因组序列染色体部分都是一致的;
(2) 注释版本主流的是RefSeq和GENCODE,后者以GENCODE v19注释结果接受度最高;
(3) 总体来说,GENCODE注释版本较RefSeq版本注释到更多的内容,包括gene, transcript, exon, CDS等;
(4) 无论下载的是GENCODE还是RefSeq注释版本,都需要特别注意不同染色体基因名重复,以及GENCODE版本基因注释位置有overlap的情况,如果gene有多个注释版本,建议取level最高的那个;
(5) 本文搜索文献的研究方向偏向三维基因组学,而ENCODE是该领域很重要的一个计划,因此GENCODE可能会搜索到更多文章。
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-26 17:58
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社