||
Bioinformatics Stronghold - GC: Computing GC Content
Identifying Unknown DNA Quickly
早期的计算软件会通过计算某一段文本中各字母出现的频率来区别不同语言。因为在一个足够长的文本中,同一语言会有其特征的字母频率。而计算软件会识别这种模式,从而准确判断出不同语言。
物种也是一样的,尽管同一物种中的两个个体基因组会不同,但不同的两个人3.2亿个碱基中一般至少有99.9%的是一致的。当研究人员研究一个未知物种的序列时就会鉴定其DNA的生物同源。因为DNA双链上的碱基互补配对,在双链DNA分子中胞嘧啶(C)和鸟嘌呤(G)的含量总是相同的。所以,在将一个未知DNA与已知数据库进行比对分析时,我们会计算GC含量(GC-content)已得到其频率模式。
在真核生物( eukaryote)中,大部分物种DNA的GC含量接近50%。但因为基因组非常之大,所以我们基于GC含量微小的差异来区别不同物种。此外,大部分原核生物(prokaryote)的GC含量要高于50%,所以我们使用少量的DNA就可以区分真核生物与原核生物。
Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated (整理,合并) into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.
In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample Dataset
Sample Output
Solution
新建txt文档,将Sample Dataset中的序列粘贴进去,命名为rosalind_gc.txt
这个用到了Biopython中的文件输入,感兴趣的童鞋可以看一看Biopython的官网了解一下(http://biopython.org)。后面的语法其实很简单,就是结合了for和if两个循环语句,然后不断判断哪一个序列的GC含量最高,最后判断完成输出结果就是了。
Rosalind is a platform for learning bioinformatics and programming through problem solving. Take a tour to get the hang of how Rosalind works.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-23 21:30
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社