luria的个人博客分享 http://blog.sciencenet.cn/u/luria

博文

人类基因组hg19及其注释

已有 13690 次阅读 2022-2-25 09:04 |个人分类:Genome|系统分类:科研笔记

从人类基因计划提出二十年来,人类基因组序列越来越完善,特别是近几年,随着三代测序技术的发展,人类基因组版本不断更新,例如biorxiv上最新发布的一篇文章<The complete sequence of a human genome>,是Telomere-to-Telomere (T2T)联盟将人类基因组中22条常染色体和X染色体做到了零gap (gapless)水平!后续的革新还会继续,但为了获得一个稳定的版本(可能不那么全),早在2009年研究者们就统一了一个版本,即我们将要讲的hg19(NCBI的版本中编号为GRCh37)。目前仍有大量的文章采用这个版本的基因组,特别是一些技术刚开发出来时都采用这个版本进行数据分析,以方便与之前的技术比较,例如promoter capture Hi-C设计探针时采用hg19,所以获取到hg19的序列和对应的注释结果依然非常重要。


1. NCBI RefSeq

hg19/GRCh37refSeq版本的基因注释结果下载自NCBI如下路径:

https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml

这个网页可以看到是RefSeq版。具体下载地址:

https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz

https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.fna.gz


另外,从GCF_000001405.25_GRCh37.p13 NCBI ftp路径:

https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20201022/GCF_000001405.25_GRCh37.p13/

可下载到

https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20201022/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz

https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20201022/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz

比较下载的GRCh37_latest_genomic.gff.gzGCF_000001405.25_GRCh37.p13_genomic.gff.gz,以及

GRCh37_latest_genomic.fna.gzGCF_000001405.25_GRCh37.p13_genomic.fna.gz。发现他们对应文件的md5值一致,表明这两处下载的是相同的文件。


2. UCSC goldenPath:

UCSC goldenPath也可以下载到hg19

https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

经过比较,hg19.fa.gz中的染色体部分的序列与GCF_000001405.25_GRCh37.p13_genomic.fna.gz中染色体部分的序列完全一致。

对应的注释有几个版本,但都是gtf格式的:

https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ensGene.gtf.gz

https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz


3. ENCODE GENCODE

这个是专门为ENCODE计划开发的人类基因组注释版本,可以从以下地址下载到:

https://www.gencodegenes.org/human/release_19.html

hg19基因组具体地址为:

https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz

经比较这个基因组染色体部分的序列与GCF_000001405.25_GRCh37.p13_genomic.fna.gz中染色体部分的序列完全一致,序列名改以chr开头

对应的注释文件v19版,下载地址:

https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz


ENCODE网页中也提供了下载路径:

https://www.encodeproject.org/files/gencode.v19.annotation/@@download/gencode.v19.annotation.gtf.gz


4. GENCODERefSeq比较

目前,GENCODE提供的人类基因组注释版本是release_39lift37 (https://www.gencodegenes.org/human/release_39lift37.html),它包括两个版本:一个是详尽版(Comprehensive gene annotation)和一个核心版的(Basic gene annotation),后者是前者的子集,仅包括染色体部分而且是protein-coding genefull-lengthprotein-coding transcripts

Sanger研究所Adam Frankish甚至专门写了一篇文章比较GENCODE v21注释版和RefSeq版的结果(Sanger研究所参与GENCODE):

Adam Frankish. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. 2015. BMC Genomics.


该文章得出的结论是:

The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.


另外,文章中还提到:

(1) GENCODE is the default annotation used by the Ensembl project, and the terms ‘Ensembl annotation’ and ‘GENCODE annotation’ are thus synonymous when referring to human.

(2) GENCODE has a higher proportion of manually annotated gene models than RefSeq and includes more novel splicing features.

总而言之,GENCODE的注释结果比RefSeq更准确,特别是可变剪切。


另外,我也比较了一下GCF_000001405.25_GRCh37.p13_genomic.gff.gzgencode.v19.annotation.gff3.gz染色体部分的注释条目数:


NCBI RefSeq

GENCODE v19

gene

27955

57820 (KNOWN=39995; NOVEL=17544; PUTATIVE=281)

transcript

mRNA=60617;   transcript=8214

196520

CDS

651848

724078

exon

846661

1196293

lnc_RNA

7276

-

miRNA

2851

-

enhancer

1914

-

total record

1675441

2615566


上表中只是列出来几个常见的feature,而total recordgff文件中的所有信息记录(行数)。从数量上比较GENCODE v19gene/transcript/CDS/exon都远超RefSeq版的,最终的信息记录也是远比RefSeq(1.56),虽然RefSeq还注释出一些miRNAprimary_transcripttRNAfeature,种类非常多,但是每种数目比较少。GENCODE v19KNOWN的基因和RefSeq中的基因做overlap,如下


结果意外发现GENCODE中的基因有重复的情况,一个基因可能有两种注释结果,例如ZNF8基因提供了HAVANAENSEMBL两个版本的注释结果,它们在位置上只有1bp的差别。

chr19 HAVANA    gene   58790317    58807254    .         +        . ID=ENSG00000083842.8;gene_id=ENSG00000083842.8;transcript_id=ENSG00000083842.8;gene_type=protein_coding;gene_status=KNOWN;gene_name=ZNF8;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ZNF8;level=2;havana_gene=OTTHUMG00000182073.1

chr19 ENSEMBL  gene   58790318    58807254    .         +        . ID=ENSG00000273439.1;gene_id=ENSG00000273439.1;transcript_id=ENSG00000273439.1;gene_type=protein_coding;gene_status=KNOWN;gene_name=ZNF8;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=ZNF8;level=3


RefSeq中也发现了基因有重复的情况,例如基因CD99同时出现在chrXchrY(类似基因共有25),这个现象在GENCODE中也存在(类似基因共有33个,其中22个基因名与RefSeq一致)

chrX  BestRefSeq  gene   2609336      2659350      .         +        .          ID=gene-CD99;Dbxref=GeneID:4267,HGNC:HGNC:7082,MIM:450000;Name=CD99;description=CD99 molecule (Xg blood group);gbkey=Gene;gene=CD99;gene_biotype=protein_coding;gene_synonym=HBA71,MIC2,MIC2X,MIC2Y,MSK5X

chrY  BestRefSeq  gene   2559336      2609350      .         +        .          ID=gene-CD99-2;Dbxref=GeneID:4267,HGNC:HGNC:7082,MIM:450000;Name=CD99;description=CD99 molecule (Xg blood group);gbkey=Gene;gene=CD99;gene_biotype=protein_coding;gene_synonym=HBA71,MIC2,MIC2X,MIC2Y,MSK5X


5. 文献中的倾向

CNS主刊文章更倾向于使用GENCODE版本的注释结果:

[1] Zhaokui Cai. RIC-seq for global in situ profiling of RNARNA spatial interactions. 2020. Nature (RIC-seq引文) (GENCODE v19)

[2] Hyejung Won. Chromosome conformation elucidates regulatory relationships in developing human brain. 2016. Nature (GENCODE v19)

[3] Fabian Grubert. Landscape of cohesin-mediated chromatin loops in the human genome. 2020. Nature. (GENCODE v25, lifted to GRCh37 coordinates)

[4] Yan Li. The structural basis for cohesinCTCF-anchored loops. 2019. Nature. (GENCODE v19)

[5] O. Delaneau. Chromatin three-dimensional interactions mediate genetic effects on gene expression. Science. 2019 (GENCODE v19)

[6] Silvia Domcke. A human cell atlas of fetal chromatin accessibility. Science. 2020 (gencode.v29lift37)


相比之下,采用NCBI RefSeq注释结果的CNS主刊文章相对较少:

[1] James H. Sun. Disease-Associated Short Tandem Repeats Colocalize with Chromatin Domain Boundaries. 2018. Cell

但是子刊比较多:

[1] Jesse R. Dixon. Integrative detection and analysis of structural variation in cancer genomes. 2018. Nature Genetics

[2] Xiaotian Zhang. Large DNA Methylation Nadirs Anchor Chromatin Loops Maintaining Hematopoietic Stem Cell Identity. 2020. Molecular Cell

[3] Pengze Wu. 3D genome of multiple myeloma reveals spatial genome disorganization associated with copy number variations. 2017. Nature Communications


需要注意的是第一篇PCHiC的文章采用RefSeq,基于这篇文章中探针的文章多采用RefSeq

[1] Borbala Mifsud. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. 2015. Nature Genetics(第一篇PCHiC)

[2] Joseph S. Baxter. Capture Hi-C identifies putative target genes at 33 breast cancer risk loci. 2018. Nature Communications

[3] Nan Zhang. Muscle progenitor specification and myogenic differentiation are associated with changes in chromatin topology. 2020. Nature Communications

但是也有不一致的,例如:

[1] Inkyung Jung. A compendium of promoter-centered long-range chromatin interactions in the human genome. 2019. Nature Genetics (自行设计探针,采用GENCODE v19版中confidence levels 12的注释结果)

[2] Michael Song. Mapping cis-regulatory chromatin contacts in neural cells links neuropsychiatric disorder risk variants to target genes. 2019. Nature Genetics (自行设计探针,采用GENCODE v19)

 

总结:

(1) 在以上所有网页下载的hg19基因组序列染色体部分都是一致的;

(2) 注释版本主流的是RefSeqGENCODE,后者以GENCODE v19注释结果接受度最高;

(3) 总体来说,GENCODE注释版本较RefSeq版本注释到更多的内容,包括gene, transcript, exon, CDS等;

(4) 无论下载的是GENCODE还是RefSeq注释版本,都需要特别注意不同染色体基因名重复,以及GENCODE版本基因注释位置有overlap的情况,如果gene有多个注释版本,建议取level最高的那个;

(5) 本文搜索文献的研究方向偏向三维基因组学,而ENCODE是该领域很重要的一个计划,因此GENCODE可能会搜索到更多文章。




https://blog.sciencenet.cn/blog-2970729-1326911.html

上一篇:BUSCO 运行步骤
下一篇:pybedtools filter bug 避坑
收藏 IP: 223.76.222.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (1 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-10-11 16:38

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部