TickingClock的个人博客分享 http://blog.sciencenet.cn/u/TickingClock

博文

Rosalind 11 - Computing GC Content

已有 4494 次阅读 2017-10-27 10:54 |个人分类:Python Learning|系统分类:科研笔记

Bioinformatics Stronghold - GC: Computing GC Content


Identifying Unknown DNA Quickly


早期的计算软件会通过计算某一段文本中各字母出现的频率来区别不同语言。因为在一个足够长的文本中,同一语言会有其特征的字母频率。而计算软件会识别这种模式,从而准确判断出不同语言。


物种也是一样的,尽管同一物种中的两个个体基因组会不同,但不同的两个人3.2亿个碱基中一般至少有99.9%的是一致的。当研究人员研究一个未知物种的序列时就会鉴定其DNA的生物同源。因为DNA双链上的碱基互补配对,在双链DNA分子中胞嘧啶(C)和鸟嘌呤(G)的含量总是相同的。所以,在将一个未知DNA与已知数据库进行比对分析时,我们会计算GC含量GC-content)已得到其频率模式。


在真核生物eukaryote)中,大部分物种DNA的GC含量接近50%。但因为基因组非常之大,所以我们基于GC含量微小的差异来区别不同物种。此外,大部分原核生物prokaryote)的GC含量要高于50%,所以我们使用少量的DNA就可以区分真核生物与原核生物。


Problem


The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.


DNA strings must be labeled when they are consolidated (整理,合并) into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.


In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.


Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).


Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.


Sample Dataset


>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT


Sample Output


Rosalind_0808
60.919540


Solution


新建txt文档,将Sample Dataset中的序列粘贴进去,命名为rosalind_gc.txt


>>> from Bio import SeqIO
>>> Target_seq = SeqIO.parse('rosalind_gc.txt', "fasta")
>>> GC_content = 0
>>> ID = ''
>>> for target in Target_seq:
...        if GC_content < (float(target.seq.count('C') + target.seq.count('G'))/len(target.seq))*100:
...          GC_content = (float(target.seq.count('C') + target.seq.count('G'))/len(target.seq))*100
...          ID = target.description
...
>>> print ID, '\n', GC_content
Rosalind_0808
60.9195402299
>>>


这个用到了Biopython中的文件输入,感兴趣的童鞋可以看一看Biopython的官网了解一下(http://biopython.org)。后面的语法其实很简单,就是结合了for和if两个循环语句,然后不断判断哪一个序列的GC含量最高,最后判断完成输出结果就是了。


Over


Rosalind is a platform for learning bioinformatics and programming through problem solving. Take a tour to get the hang of how Rosalind works.


P.S. 欢迎关注微信公众号:微信号Plant_Frontiers


https://blog.sciencenet.cn/blog-3158122-1082620.html

上一篇:Plant Biotechnol J:过表达AtCesA6-like基因增加拟南芥生物量
下一篇:Plant Cell:拟南芥全基因组复制后同源基因的功能分化
收藏 IP: 221.181.145.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-11-23 21:30

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部