||
今天用自己的数据做了一下circos圈图,当然,是在生信媛那三篇教程的基础之上(http://mp.weixin.qq.com/s/qXDlKf7ut8QF2c5_0hiKpAhttp://www.biotrainee.com/thread-755-1-1.html),修改了部分参数。其实,主要是数据格式的处理过程,我采用python脚本处理的,处理起来感觉还好,没有多少难度,因为大部分的数据是提取两列而已,不需要我进行更复杂的处理,更复杂的处理估计也有软件解决,不需要我造轮子了。
http://www.zd200572.com/2017/10/12/circos_learnning-2/
下面是我的过程记录:
我的原始数据来自美吉生物的结果,有一个assemble文件夹里边有一个fasta_name.txt,有这些统计信息,想起来了,我个我是用一个shell命令获得的,但是这些信息都在那个组装好的基因组文件里。
cat d-1.Contigs.fna | grep ^'>' > fasta_name.txt
把数据处理成circos格式,一个python脚本解决(虽然python水平蹩脚,但是能解决小问题了):
fin = open('fasta_name.txt')contig = ''start = 0end = 0dict = {}for line in fin: li = [] contig = line.strip().split(' ')[0].split('>')[1] dict[contig] = '' li.append(start) end += int(line.strip().split('=')[1].split(' ')[0]) li.append(end) dict[contig] = li start = end #print(dict)fout = open('tRNA.gff', 'w')fin2 = open('G:\\Users\\shenyou\\Desktop\\ddm-1\\tRNA\\tRNA.gff')for line in fin2: #print(line) contig = line.strip().split(' ')[0].strip() print(contig) start1 = int(line.strip().split(' ')[2]) print(start1) end1 = int(line.strip().split(' ')[3]) print(end1) #break start1 += dict[contig][0] end1 += dict[contig][0] fout.write(contig + ' ' + str(start1) + ' ' + str(end1) + '\n')fout.close()
上面获得了,contig的数据,基本的环就形成了,然后再按生信媛的教程加上刻度不表。下面我选择了测序数据报告里的tRNA、rRNA和基因的预测数据,进行了处理,处理方式都是一样的,可以用一个脚本改改参数解决。主要处理方式就是,把那个在每个contig上的相对刻度转化为基因组的绝对刻度,我是字典解决问题的。脚本如下:
fin = open('fasta_name.txt')contig = ''start = 0end = 0dict = {}for line in fin: li = [] contig = line.strip().split(' ')[0].split('>')[1] dict[contig] = '' li.append(start) end += int(line.strip().split('=')[1].split(' ')[0]) li.append(end) dict[contig] = li start = end #print(dict)fout = open('predict.gff', 'w')fin1 = open('G:\\Users\\shenyou\\Desktop\\ddm-1\\predict\\ddm-1.ffn')for line in fin1: if line.startswith('>'): contig = line.strip().split(' ')[3] start1 = int(line.strip().split(' ')[4]) end1 = int(line.strip().split(' ')[5]) start1 += dict[contig][0] end1 += dict[contig][0] fout.write(contig + ' ' + str(start1) + ' ' + str(end1) + '\n')fout.close()
其实,我个步骤对于我来说是最难的,我试图找一个软件来解决,但是碍于自己知识不足,都搜不到相关信息,只有自己造轮子了,写了一个小脚本竟然解决问题了,有点成就感。主要思路就是按生信媛教程里的数据格式,每1000bp计算一次gc含量,对了,还有偏移可以用,由于不清楚正负链的情况,就一锅端了。脚本如下:
fin1 = open('fasta_name.txt')contig = ''start = 0end = 0dict = {}for line in fin1: li = [] contig = line.strip().split(' ')[0].split('>')[1] dict[contig] = '' li.append(start) end += int(line.strip().split('=')[1].split(' ')[0]) li.append(end) dict[contig] = li start = end #print(dict)i = 1fin = open('G:\\Users\\shenyou\\Desktop\\ddm-1\\assembly\\ddm-1.Contigs.fna')seq = ''di = {}flag = 0#sequence = ''for line in fin: if line.startswith('>'): if seq != '': di[contig] = seq #print(contig, di[contig][:10]) contig = line.strip().split(' ')[0].split('>')[1] #print(contig) i += 1 #break di[contig] = '' seq= '' else: seq += line.strip() #sequence += line.strip()seq_gc = 0.6537235194562051#print(di.keys())fout = open('gc.txt', 'w')for a in di.keys(): #print(a) j = 0 for j in range(int(len(di[a]) / 1000) - 1): if j == 0: if len(di[a]) >= 1000: #print(str(di[a])) seqi_gc = (di[a][0:1000].count('G') + di[a][0:1000].count('C')) / 1000 start = dict[a][0] + 1 end = 1000 + dict[a][0] elif len(di[a]) < 1000: #print(str(di[a])) seqi_gc = (di[a][0:len(di[a])].count('G') + di[a][0:len(di[a])].count('C')) / len(di[a]) start = dict[a][0] + 1 end = len(di[a]) + dict[a][0] elif 1 <= (j + 1) < int(len(di[a]) / 1000): #print(str(di[a])) seqi_gc = (di[a][j*1000:(j + 1)*1000].count('G') + di[a][j*1000:(j + 1)*1000].count('C')) / 1000 start = j*1000 + dict[a][0] + 1 end = (j + 1)*1000 + dict[a][0] elif (j + 1) == int(len(di[a]) / 1000): #print(str(di[a])) seqi_gc = (di[a][j*1000:len(di[a])].count('G') + di[a][j*1000:len(di[a])].count('C')) / (len(di[a]) - j*1000) start = j*1000 + dict[a][0] + 1 end = len(di[a]) + dict[a][0] print(start, end) fout.write(str(a) + ' ' + str(start) + ' ' + str(end) + ' ' + str(seqi_gc) + '\n') j += 1 #breakfout.close()
根据教程,我的配置文件如下,没有出现报错的情况,但是想要掌握这个画图,还是要下很多很多功夫的。
<<include etc/colors_fonts_patterns.conf>><<include etc/image.conf>>karyotype = karyotype.txtchromosomes_units = 100000chromosomes_display_default = yesdefault = 0.005rradius = 0.80rthickness = 6pfill = yesfill_color = deepskybluestroke_color = blackstroke_thickness = 1pshow_label = yeslabel_font = lightlabel_radius = 1r + 110plabel_size = 30label_parallel = yes<<include etc/housekeeping.conf>>show_ticks = yesshow_tick_labels = yesskip_first_label = noskip_last_label = noradius = dims(ideogram,radius_outer)color = blackthickness = 2psize = 30pmultiplier = 1e-6format = %.2fspacing = 20000bsize = 10pshow_label = nothickness = 3pspacing = 100000bsize = 20pshow_label = yeslabel_size = 25plabel_offset = 10pformat = %.2fz = 0file = predict.gffr0 = 0.90rr1 = 0.99rfill_color = 169,169,169file = tRNA.gffr0 = 0.81rr1 = 0.90rfill_color = 169,169,169file = rRNA.gffr0 = 0.27rr1 = 0.36rtype = linethickness = 1pz = 2max_gap = 0ufile = gc.txtcolor = redfill_color = redr0 = 0.69rr1 = 0.78rorientation = outz = 2max_gap = 0ufile = gc_a_m.txtcolor = aquamarinefill_color = aquamariner0 = 0.48rr1 = 0.57rorientation = out
最基础的图画好了,凑活着看了,没想到第一次画就能画成这样,感谢教程。
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-10-20 05:07
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社