|||
现在基因组非编码区域的注释条数越来越多,那么究竟注释了多少非编码基因,多少假基因呢?
这里仅以Gencode数据库中人类和老鼠的最新注释为例。在老鼠的注释信息中(Gencode M19),注释的基因数目总共为5.44万。其中蛋白编码基因大致是2.2万,lncRNA基因和假基因大致为1.3万。在人类的注释信息中(Gencode V29),注释的基因数目总共为5.87万。蛋白编码基因大致是2万;lncRNA基因和假基因大致分别为1.6万和1.47万。每类具体数目如下:
GENCODE M19 | GENCODE V29 | ||
TYPE | COUNT | TYPE | COUNT |
Total No of Genes | 54446 | Total No of Genes | 58721 |
Protein-coding genes | 21969 | Protein-coding genes | 19940 |
Long non-coding RNA genes | 12840 | Long non-coding RNA genes | 16066 |
Small non-coding RNA genes | 6108 | Small non-coding RNA genes | 7577 |
Pseudogenes | 13033 | Pseudogenes | 14729 |
- processed pseudogenes | 9772 | - processed pseudogenes | 10679 |
- unprocessed pseudogenes | 2873 | - unprocessed pseudogenes | 3535 |
- unitary pseudogenes | 39 | - unitary pseudogenes | 219 |
- polymorphic pseudogenes | 79 | - polymorphic pseudogenes | 41 |
- pseudogenes | 67 | - pseudogenes | 18 |
Immunoglobulin/T-cell receptor gene segments | Immunoglobulin/T-cell receptor gene segments | ||
- protein coding segments | 494 | - protein coding segments | 408 |
- pseudogenes | 203 | - pseudogenes | 237 |
Total No of Transcripts | 137767 | Total No of Transcripts | 206694 |
Protein-coding transcripts | 57776 | Protein-coding transcripts | 83129 |
Nonsense mediated decay transcripts | 6816 | Nonsense mediated decay transcripts | 15291 |
Long non-coding RNA loci transcripts | 18065 | Long non-coding RNA loci transcripts | 29566 |
以前只要是非编码我大致都认为是不能翻译成氨基酸的。但是后来越来越多的文章指出很多的非编码区域是可以翻译出氨基酸的。既然是可以翻译出氨基酸,那么就应该有起始密码子和终止密码子,也有可能有UTR区域了。出于个人好奇,我统计ensembl数据库中人类(Homo_sapiens.GRCh38.94.gtf)和老鼠(Mus_musculus.GRCm38.94.gtf)的每类注释的具体数目,以及这些注释中蛋白编码基因所占的总数目。具体数目如下表所示。
Mus_musculus.GRCm38.94.gtf | Homo_sapiens.GRCh38.94.gtf | ||||||
TYPE | AllAnnotation | OnlyPcg | Ratio | TYPE | AllAnnotation | OnlyPcg | Ratio |
CDS | 512583 | 511014 | 0.996939 | CDS | 746504 | 745198 | 0.998251 |
5'UTR | 92374 | 92064 | 0.996644 | 5'UTR | 149930 | 149646 | 0.998106 |
3'UTR | 83692 | 83574 | 0.99859 | 3'UTR | 148491 | 148326 | 0.998889 |
start_codon | 58377 | 57823 | 0.99051 | start_codon | 86454 | 86115 | 0.996079 |
stop_codon | 54262 | 54141 | 0.99777 | stop_codon | 78562 | 78453 | 0.998613 |
exon | 813724 | 734421 | 0.902543 | exon | 1262162 | 1119281 | 0.886797 |
transcript | 137862 | 99138 | 0.71911 | transcript | 206601 | 151150 | 0.731603 |
gene | 54532 | 22046 | 0.404276 | gene | 58735 | 19951 | 0.339678 |
#AllAnnotation: gtf文件中的所有注释信息。OnlyPcg:仅仅来源于蛋白编码基因的注释信息。
虽然在ensembl/gencode数据库中注释的蛋白编码基因仅仅只占总基因的34%(人类)和40%(老鼠),但是CDS, 5'UTR, 3'UTR, 起始密码子和终止密码子的注释几乎全部来自蛋白编码基因。
为了进一步确定在这些注释文件中非编码区域是否存在CDS, 5'UTR, 3'UTR, 起始密码子和终止密码子。我又进行了如下的统计(如下表所示).
Homo_sapiens.GRCh38.94.gtf | |||||
Type | All | Pcg | Pseudo | Lnc | Snc |
CDS | 746504 | 745198 | 533 | 0 | 0 |
5'UTR | 149930 | 149646 | 76 | 0 | 0 |
3'UTR | 148491 | 148326 | 128 | 0 | 0 |
start_codon | 86454 | 86115 | 89 | 0 | 0 |
stop_codon | 78562 | 78453 | 74 | 0 | 0 |
exon | 1262162 | 1119281 | 43197 | 90747 | 7085 |
gene | 58735 | 19951 | 15224 | 15949 | 7073 |
transcript | 206601 | 151150 | 18404 | 29237 | 7085 |
#Pcg: protein-coding genes; Pseudo: Pseudogenes; Lnc: long non-coding genes; Snc: small non-coding genes
Mus_musculus.GRCm38.94.gtf | |||||
Type | All | Pcg | Pseudo | Lnc | Snc |
CDS | 512583 | 511014 | 404 | 0 | 0 |
5'UTR | 92374 | 92064 | 122 | 0 | 0 |
3'UTR | 83692 | 83574 | 89 | 0 | 0 |
start_codon | 58377 | 57823 | 119 | 0 | 0 |
stop_codon | 54262 | 54141 | 90 | 0 | 0 |
exon | 813724 | 734421 | 22780 | 48732 | 6094 |
gene | 54532 | 22046 | 13037 | 12673 | 6090 |
transcript | 137862 | 99138 | 13947 | 17808 | 6091 |
现在的Ensembl数据库人类和老鼠的最新注释中, long/small non-coding genes都没有CDS, 5'UTR, 3'UTR, 起始密码子和终止密码子。但是假基因都有。
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2023-9-30 05:18
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社