ChengyangWang的个人博客分享 http://blog.sciencenet.cn/u/ChengyangWang

博文

Ensembl、UCSC、Refseq,该用哪个

已有 13400 次阅读 2017-12-15 14:42 |个人分类:RNA-seq|系统分类:科普集锦| RNA-seq


本文转载自嘉因微信公众号,已获得授权。查看最新文章,敬请关注嘉因,微信ID:rainbow-genome

作者:小哈   来源:嘉因

大家都会做方便面,有人做辛拉面,有人做三鲜伊面,工艺有何不同?


大家都会做RNA-seq,有人能筛出有意义的基因,有人能找出有价值的线索,有人。。。差别在哪?


前四期介绍了数据均一化处理、差异基因筛选、画heatmap和富集分析的合理方法:


第一期:数据预处理:同一套RNA-seq,为什么公司做的跟师兄跑的结果不一样? | TPM、read counts、RPKM/FPKM你选对了吗?


第二期:差异基因筛选:同一套RNA-seq,公司筛出的差异基因跟师兄筛出的为什么不一样?| Pvalue, FDR, cutoff


第三期:heatmap:heatmap画不好会得出错误结论 | 数据预处理、聚类分析,HCL、 K means里的讲究


第四期:富集分析:富集分析,俩人做的结果差5岁 | 你用的注释文件有多老?


小哈让我们算read counts,可是,为什么我算的read counts跟公司算的还是不一样?本期回过头来看mapping时选用的Gene model对结果的影响。



拿到测序数据,首先要把read回帖到基因组上,这时需要基因组序列fasta文件,还要告诉它基因组上哪个位置有基因,即gene model,保存在gtf文件里。


如果分析人或小鼠的数据,就用GENCODE。那么,著名的Ensembl、UCSC、Refseq,跟GENCODE是啥关系?其他物种用哪个呢?


Ensembl说目前这个版本的 GENCODE = Ensembl,www.ensembl.org/Help/Faq?id=303

只有GENCODE自己知道,它跟ensembl还是有些区别的,GTF文件稍有不同,www.gencodegenes.org/faq.html


点击查看清晰大图

Ensembl、UCSC、Refseq,选择不同,对结果有多大影响有人专门做了对比。

先总体评价了三种gene model对mapping的影响;


然后举例细看对某些基因的具体影响;



先说结论:


  • Gene model会影响基因表达量乃至差异表达基因的筛选,尤其是不同gene model对某些基因的长度、junction位点注释有出入;


  • Ensembl的注释相对更加准确,基因更多;


  • 推荐人鼠用GENCODE,谁让它出自最权威的ENCODE呢,其他物种用Ensembl




下面逐个查看文章里的结果:


The read mapping summary in the “transcriptome only” and “transcriptome + genome” mapping modes:


  • more reads are mapped in Ensembl than in RefGene and UCSC in the “transcriptome only” mode


  • more reads become multiple-mapped in Ensembl than in RefGene and UCSC


  • The RefGene and UCSC consistently have the highest percentage of uniquely mapped reads;

  • while the percentage of non-uniquely mapped reads is much higher in Ensembl.

  • Without a gene model (indicated in pink) in the mapping step, a constant 6% of reads become unmapped.


Divided uniquely mapped reads into two classes, i.e., non-junction reads and junction reads, and investigated the impact of a gene model on their

mapping.


  • The impact of a gene model on mapping of non-junction reads is different from junction reads.

  • For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used.

  • By contrast, this percentage dropped to 53% for junction reads.

  • In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10– 15% mapped alternatively.



The overlap and intersection among RefGene, UCSC, and Ensembl annotations


  • In general, different annotations have very high overlaps: there are 21,598 commongenes shared by all three gene models.

  • RefGene has the fewest unique genes

  • while more than 50% of genes in Ensembl are unique.


The correlation of gene quantification results between RefGene and Ensembl


  • Although the majority of genes have highly consistent or nearly identical expression levels, there are many genes whose quantification results are dramatically affected by the choice of a gene model




具体看每个基因的read counts,用Ensembl和RefGene算出来的read counts差好远,为什么呢?下面举例看2个基因的情况



The different gene definitions for PIK3CA give rise to differences in gene quantification


  • PIK3CA in the Ensembl annotation is much longer than its definition in RefGene, explaining why there are 1094 reads mapped to PIK3CA in Ensembl, while only 492 reads are mapped in RefGene.

  • The PIK3CA gene definition in Ensembl seems more accurate than the one in RefGene, based upon the mapping profile of sequence reads.

The different gene definitions for LUZP6.


  • In the Ensembl annotation, LUZP6 is only 177 bp long, and it is completely within another gene, MTPN.

  • As a result, all sequence reads originating from LUZP6 are assigned to MTPN instead.


  • In RefGene, LUZP6 and MTPN are derived from the same genomic region, and both encode exactly the same mRNA, though the protein coding sequences are different.

  • Therefore, all reads mapped to this region are equally distributed between these two genes.


The correlation of the calculated Log2Ratio (heart/liver) between RefGene and Ensembl.


  • Although the majority of genes have highly consistent expression changes, there are many genes that are remarkably affected by the choice of different gene models.





https://blog.sciencenet.cn/blog-3372875-1089846.html

上一篇:欲哭无泪的p-value = 0.051 | 做几次重复能得到较低的p-value
下一篇:同样用Ensembl算TPM,结果还是不一样? | Ensembl的注意事项
收藏 IP: 124.77.56.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-19 14:09

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部