|||
Stormo以研究生身份进入位于Border的University of Colorado大学,很快就对Larry Gold实验室中如何理解基因调控的问题产生了兴趣。在他的研究生的时候,DNA的测序方法已经被发展起来,这样他马上就获得了许多调控位点的大量例子,可以把它们相互比较,同时也可与他收集的突变位点进行比较。和Tom Scheinder一起,他着手编写了一个程序集,来对可能地数据进行各种各样的分析。那个时代,算法和数学都用得不是很难,甚至是非常简单的策略都是新的和有用的,他大胆地采用了一些人工智能的方法技巧来帮助理解,而最大的挑战是在他们不得不自己做所有的事情。GenBank那时还没有出现,所以他们不得不建立自己的数据库,用来存储他们的DNA序列和自己的注释;他们甚至不得不自行输入绝大部分的数据(这就花费了大量的错误检查时间),因为在那个时代大量数据只是简单发表在杂志上的。
作为Stormo博士论文一部分,他发展了序列谱(profile,又叫位置加权矩阵,position weight matrix)作为比共有序列(Consensus)表征调控位点的一种更好的表示,发表了根据可用数据的不同类型和不同用途的一些相应的方法来得到序列谱。不仅如此,他还寻找了一类问题的解决方法:给定一些DNA序列的样本,在其中的未知位置上具有一些调控位点,发现这些位点的序列谱矩阵的问题现在称之为寻找序列谱问题(Motif Finding Problem)。几年以前,Michael Waterman已经发表了一个从一群DNA序列的样本中发现共有序列的算法,Stormo试图用序列谱的方式重复同样的事情。这个问题本来就有两个方面:怎样在没有检查所有的联配时,找到正确的调控位点之间的匹配?怎样评价在不同的匹配,选择最好的?对于评价这一步,他用了Tom Scheinder博士论文中基于熵的信息学方法进行度量,因为它具有很好的统计学特性,他们证明在一些简化了的假设条件下,熵的方法可以直接对应于蛋白质在这些位点的绑定能量。这个回顾似乎毫无意义,可在那个时代,想出这个方法却花费了他们相当大的努力,这个方法被用在贪婪法的CONSENSIS程序中。
Stormo相信这主意会起作用,如他所想,事实正是如此:只要问题不是太难,从背景中凸显的模式就含有足量有效信息的内容。他知道这是一个非常有用的工具,虽然那时没有人能预料DNA阵列将使得它更为有用(这种实验可以更容易的推断出共调控的基因以选择好的样本)。当然,真实的这样的数据含有更大的噪声,因此算法必须改得具有更大的鲁棒性。
在Stormo得到了他的博士学位之后,与他的指导者在Universit of Oregon的Larry Gold 和Pete von Hippel在合作项目中的工作,使他获得了最令人振奋的科研经验。Gold的小组以前研究T4噬菌体的基因,名叫32(它参与基因复制、重组和修复),已证明了它参与调控自己的转录水平的合成。Von Hippel的小组已经可以测量蛋白质绑定的参数,另一个小组刚刚测序了这个基因的序列和他的调控区域。通过整合序列分析和蛋白质的绑定参数,包括与其它序列进行比较,他们可以提供在基因调控中蛋白质活动的模型。一些年以后,Stormo开始帮助在两个较近的噬菌体T2和T6中,通过比较调控区域填补了更为详细模型的细节:研究显示有一个保守的假节(pseudoknot)结构在成核作用位点对自发绑定起作用。Stomo说:
“这个结果非常令人满意,这是由于该问题的多个不同的方面,从生物物理的度量到遗传学,再到序列分析,集中到一起描述了一个真正有趣的基因调控的例子。”
科学发现可以从多种途径得到,最重要的是要时时准备好。一些人将找到一类特殊的问题,并努力研究它,把所有的工具都用上,甚至发明新的工具来试图解决它。另一种途径是找到不同问题之间的联系,或者把一个领域中的方法应用到另一个领域的问题中。Gary认为虽然集中地努力研究一个特定的问题也很重要,而这种跨学科间的策略在生物信息中特别有用。他的研究总是根据他的兴趣很快从起始的问题焦点容易地转到别的地方。他觉得如果他是沿着一条更连续的工作路线,可能会在某个特定的领域作出更大的贡献,但是他非常喜欢广泛的阅读,为他可以作出贡献的问题进行工作,即使这些问题不是他主要的研究领域。
“我认为虽然基因表达调控作为一个重要的问题有显著性的进展已经做出了,但对它的研究将持续很长的时间,这由于在调控因子和被调控的基因间还有许多没有探明的关系。加之大量的基因调控发生在转录后时期,在系统的道路上,我们仅仅开始了第一步。问题中主要的挑战是以能真正理解整个的调控网络为主要目标。我也认为进化生物学将成为一个发展的重要课题,可以来更深入理解地球上生物的多样性。”
原文
Gary Stormo, born 1950 in South Dakota, is currently a professor in the Department of Genetics at Washington University in St. Louis. Stormo went to Caltech as a physics major, but switched to biology in his junior year. Although that was only at an undergraduate level, the strong introduction to the physical sciences and math helped prepare him for the opportunities that came later. He has a PhD in Molecular Biology from the University of Colorado at Boulder. His principal research interests center around the analysis of gene regulation and he was an early advocate of using computers to infer regulatorymotifs and understand gene regulation.
He went to the University of Colorado in Boulder as a graduate student and quickly got excited about understanding gene regulation, working in the lab of Larry Gold. During his graduate career, methods for sequencing DNA were developed so he suddenly hadmany examples of regulatory sites that he could compare to each other, and could also compare to the mutants that he had collected. Together with Tom Schneider he set out to write a collection of programs for various kinds of analysis on the sequences that were available. At the time neither the algorithms nor the math were very difficult; even quite simple approacheswere newand useful. He did venture into some artificial intelligence techniques that took some effort to understand, but the biggest challenge was that they had to do everything themselves. GenBank didn’t exist yet so they had to develop their own databases to keep all of the DNA sequences and their annotation, and they even had to type most of them in by hand (with extensive error checking) because in those days most sequences were simply published in journals.
As part of his thesiswork he developed profiles (also called positionweight matrices) as a better representation of regulatory sites than simple consensus sequences. He had published a few different ways to derive the profiles matrices, depending on the types of data available and the use to be made of them. But he was looking for a method to discover the matrix if one only knew a sample of DNA sequences that had the regulatory sites somewhere within them at unknown positions, the problem that is now known as the Motif Finding problem. A few years earlier Michael Waterman had published an algorithm for discovering a consensus motif from a sample of DNA sequences, and Stormo wanted to do the same thing with a profile representation. The problem has two natural aspects to it, how to find the correct alignment of regulatory sites without examining all possible alignments, and how to evaluate different alignments so as to choose the best. For the evaluation step he used the entropy-based information content measure from Tom Schneider’s thesis because it had nice statistical properties and they had shown that, with some simplifying assumptions, it was directly related to the binding energy of the protein to the sites. In retrospect is seems almost trivial, but at the time it took them considerable effort to come up with the approach that is employed in the greedy CONSENSUS program.
Stormo knew that the idea would work, and of course it did, so long as the problem wasn’t too difficult—the pattern had to have sufficient information content to stand out from the background. He knew this would be a very useful tool, although at the time nobody anticipated DNA array experiments which make it even more useful because one can get samples of putatively coregulated genes somuchmore easily. Of course these data havemore noise than originally thought, so the algorithms have had to become more robust.
One of Stormo’s most enjoyable scientific experiences came soon after he got his PhD and began working on a collaborative project with his adviser, Larry Gold and Pete von Hippel at the University of Oregon. Gold’s lab had previously shown that the gene in the T4 phage known as “32” (which participates in replication, recombination, and repair) also regulated its own synthesis at the translational level. von Hippel’s group had measured the binding parameters of the protein, while another group had recently sequenced the gene and its regulatory region. By combining the binding parameters of the protein with analysis of the sequence, including a comparison to other gene sequences, they were able to provide a model for the protein’s activity in gene regulation. A few years later Stormo got to help fill in some more details of the model through a comparison of the regulatory region from the closely related phages T2 and T6 and showed that there was a conserved pseudoknot structure that acted as a nucleation site for the autogenous binding. Stormo says:
This was very satisfying because of how all of the different aspects of the problem, from biophysical measurements to genetics to sequence analysis came together to describe a really interesting example of gene regulation.
Discoveries can come in many different ways, and the most important thing is to be ready for them. Some people will pick a particular problem and work on it very hard, bringing all of the tools available, even inventing newones, to try and solve it. Another way is to look for connections between different problems, ormethods in one field that can be applied to problems in another field. Gary thinks this interdisciplinary approach is particularly useful in bioinformatics, although the hard work focused on specific problems is also important. His research style has always been to follow his interests which can easily wander from an initial focus. He feels that if he had followed a more consistent line of work he could have made more progress in certain areas, but he really enjoys reading widely and working on problems where he canmake a contribution, even if they are outside his major research areas.
I think the regulation of gene expression will continue to be an important problem for a long time. Although significant progress has been made, there are still lots of connections to be made between transcription factors and the genes they regulate. Plus lots of gene regulation happens post-transcriptionally and we are just beginning to look at that in a systematic way. The ultimate goal of really understanding the complete regulatory networks is a major challenge. I also think evolutionary biology will be an increasingly important topic for understanding the diversity of life on the planet.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-19 14:24
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社