|||
在大学他学习数学的时候,Haussler为他的兄弟Mark在University of Arisona的一个分子生物学的实验室工作,David回忆这段高兴的时光说:
“我们从小鸡的肠道中提取维生素D荷尔蒙的受体,它们是由维生素D导致的,我们用它来研究人类血液样本的维生素D水平。我的工作是杀小鸡,从它们的肠子里提取维生素D的受体,并对它们提供检验,最后对这些实验结果做数学分析。这个工作非常地成功,导致了Science文章的发表。可是,就是在那,我决定更喜欢数学而不是分子生物学。”
在杀了许多小鸡之后,David决定在University of Colorado计算机学系与Andrzej Ehenfeucht一起研究,攻读博士学位。他非常兴奋地接触计算和逻辑学。他发现Ehrenfeucht是在这个领域的领导者之一,就找他做为自己的导师,他早期的论文分布在数学的不同的领域,并参加了Ehrenfencht组织的讨论组,这个讨论组的主题主要是DNA。这是在八十年代早期,当几个病毒的完整序列首先变得可以使用的年代。这个组的另外两个学生直接进入了生物信息的职业生涯:Genes Myers,他用Celera基因组学方法组装了人类基因组,而Gary Stormo在结构域(motif)寻找领域做了先驱性的工作。而Haussler在这段时期在生物信息学中发表了几篇文章,那时他感觉还没有足够的数据来支持生物信息学的整个的领域,因此他仍然远离了这个领域,等待技术的进步使它得以起飞。作为替代,Haussler选择了另外他的一个兴趣:研究人工智能--因为他想了解人脑是怎样工作的,他开始涉及建造人工神经网络和适应的计算机算法来提高它们遇到更多数据时的成效。研究适应和学习的理论把他最终引向了HMM(Hidden Markov Models)。
在九十年代早期,分子生物学家开始更快地大量产生数据,Haussler在分子生物学上的兴趣被复苏了,他改变了他科研的目标,从理解脑的工作到了解细胞怎样工作。他开始运用在语言学中相同类型的模型和正则语法分析的方法来分析生物序列,提供了和这个领域的其他许多科学家更近一步工作的基础。特别是他的工作组发展了HMM方法进行基因预测和蛋白分类。HMM的丰富语汇足够允许生物信息学家建立模型来捕捉DNA上离奇的信息,因此这些模型变得流行起来。
HMM方法在哪里都没有出现过,David Sankoff, Michal Waterman和Temple Smith把动态规划算法引入了比对生物序列。David Searls在生物序列与正则语法子串采用类推法进行分析。Gary Stormo和Chip Lawrence发展了关键的想法用于序列分类和结构域寻找。HMM正是很需要被引入这个领域,因为它把所有的事情联系在一个简单自然的框架内。
当Anders Krogh作为博士后加入了他的小组时,Hausser开始应用HMM到生物序列分析中。他们两个都熟悉HMM模型,且Haussler仍然保持着对DNA序列和蛋白序列的兴趣。一天,他们开始讨论用HMM可能建立一个蛋白质序列的很好的模型的大胆想法。在很快地检查了文献之后,他们发现生物信息学这门年轻的学科如果运用了这个初始的想法就成熟了,但还没有人把这些片段真正整合到一起,并采用它。然后,他们就迫切地开始进行了,建立了HMM模型来识别不同的蛋白家族。揭示这个方法怎样可以把动态规划,语法学和最大似然法统计推断结合起来,从此HMM成为生物信息学中广受欢迎的方法。David Haussler说:
“Anders和我对把HMM运用到蛋白序列上的真谛的顿悟当然是我学术生涯中一个令人兴奋的时刻之一,但是我不得不说最令人兴奋的时刻是紧接着发生的,当我的小组参加了人类基因组的测序工作之时。Jim Kent那时是UCSC的一个研究生,是第一个能够计算拼接公开测序工程数据,并完成一个草图的科学家。我们和Gene Myers在Celera的团队竞赛,他们也在拼接基因组的草图。在2000年7月,Francis Collins和Craig Venter共同在白宫的庆典上宣布他们两个小组成功地组装了人类基因组之后不久,我们也把公有工作草图发表在了国际互联网上。在7月7日,当洪水一样的A,C,T,G字母出现在我的电脑屏幕上时,通过网络它们被遍布世界的其他的人下载,那是我科学生涯中最令人兴奋的时刻。对我来说,整个人类基因组计划是一个象征,公开的和私有的,许多科学家参与的无边界的测定工作。在有机分子原始的汤中,细胞用DNA携带重要的信息,从一代传到下一代,然后变得多样化,这些信息变得更加复杂,生命体扩充了自己的能力,最终导致了进化,最后其中的一个物种收集了这些天赋的才能,解码和分享了它自己DNA的信息,这一切看起来是不可想象的。但是,那一天来到了。“
Haussler的研究工作方法总是在各学科的极端交叉的边缘的范围内进行工作。他的想法大多数来自于一个领域的观点应用到另一个领域的问题中。
“我试图保持集中在一些大的科学问题中,而不是限制我的方法符合任何已经建立起来和很窄的方法探索上。我试图总是仔细听少数很棒的科学家在所有领域的工作,而不是所有的科学家在一个领域的工作。”
Haussler认为发现的关键是找到正确的问题。主要的科学问题成熟于特定的时间。在这之前,它是不可及的,因为解决它们的基础还没有打好。而在这之后,它们就不再重要了,因为问题的核心已经被解决了。但是,认识一个科学问题是否成熟来解决是一种非常难的艺术。广泛的关注十分有帮助,还需要有很多的运气。Haussler说:
“我们还没有真正开始理解细胞是怎样工作的,因此对生物信息学家来说还有大量有趣的问题可做。计算模型对这个问题可能不象我们现在拥有的方法这样。这些与理解细胞怎样工作,怎样构成器官、机体和大脑,并使它们行使功能的相关问题,大概要使任何在这个领域的研究者郁闷很长时间。把我们知道的应用于药物实践更是一个巨大挑战的问题。但是,正是在我们能够合理的解决这些顶级问题之前,有大量的基础性的工作需要做。一个生物信息学可能取得关键作用的重要的基础工作就是怎样理解整个基因组的进化历史。有了许多不同物种全基因组序列,我们开始试图用比较基因组的方法来重建我们自己基因组进化历史上的重大事件。这需要更多的原始的基因组数据,还需要发展新的数学和算法来适应这个任务。人们总是对起源问题十分着迷,而我们自己的起源的谜是所有这些着迷之处中至高的。因此,我估计这样的科学研究永远不会失去它的兴趣所在。”
原文:
David Haussler (born October 1953 in California)
currently holds the University of California Presidential Chair in Computer
Science at the University of California at Santa Cruz (UCSC). He is also an
investigator with the Howard Hughes Medical Institute. He was a pioneer in the
application of machine learning techniques to bioinformatics and he has played
a key role in sequencing the human genome. We extracted
vitamin D hormone receptors fromt he intestines of chicks that were deprived of
vitamin D and used the extract to study the level of vitamin D in human blood
samples. My jobs were to sacrifice the chicks, extract the vitamin D receptors
from their guts, perform the assay with them, and finally do the mathematical
analysis on the results of these experiments. The work was quite successful,
and led to a publication in Science. But it was there that I decided that I was
more fond of mathematics than I was of molecular biology.
While an undergraduate studying mathematics,
Hausslerworked for his brother Mark in a molecular biology laboratory at the
University of Arizona. David has fond memories of this time:
After sacrificing many
chicks, David decided to pursue his doctorate at the University of Colorado to
study with Professor Andrzej Ehrenfeucht in the Department of Computer Science.
Excited about the interaction between computation and logic, he recognized that
Ehrenfeucht was one of the leaders in that area and sought him out as an
advisor. While his early papers were in various areas of mathematics he
participated in a discussion group organized by Ehrenfeucht that was dominated
by discussions about DNA. This was
at a time in the early 1980s when the first complete sequences from a few
viruses had become available. Two other students in this group went directly on
to careers in bioinformatics: Gene Myers, who put together the human genome
assembly for Celera Genomics, and Gary Stormo who did pioneering work on motif
finding. While Haussler did produce a few papers in bioinformatics in this
period, at the time he felt there were not enough data to sustain an entire
field of bioinformatics. So he remained rather aloof from the field, waiting
for technological advances that would allow it to take off. Haussler instead
followed another interest‚--The study of artificial intelligence--because he
wanted to try to understand how the brain works. He became involved with building artificial neural networks
and designed adaptive computer algorithms that can improve their performance as
they encounter more data. The study of adaptation and learning theory led him
into HMMs.
By the
early 1990s,molecular biologists had begun to churn out data much more rapidly.
Haussler's interest in molecular biology was rekindled, and he switched his
scientific goal from understanding how the brain works to understanding how
cells work. He began to apply the same types of models used for speech and
formal grammar analysis to the biological sequences, providing the foundation
for further work along these lines by many other scientists in the field. In
particular, his group developed HMM approaches to gene prediction and protein
classification. The HMM vocabulary is rich enough to allow a bioinformaticist
to build a model that captures much of the quirkiness of actual DNA. So these
models caught on.
The HMM aproach did not appear from nowhere. Foundational work by David
Sankoff, Michael Waterman, and Temple Smith had formalized the dynamic
programming methods to align biosequences. David Searls had made the analogy
between biosequences and the strings produced by a formal grammar. Gary Stormo
and Chip Lawrence had introduced key probabilistic ideas to the problem of
sequence classification and motif finding. HMMs were just begging to be introduced into the field since
they combined all these things in a simple and natural framework.
Haussler began applying
HMMs to biosequence analysis when Anders Krogh joined his group as a postdoc.
Both were familiar with HMMs, and Haussler still maintained an interest in DNA
and protein sequences. One day they began to talk about the crazy idea that
HMMs might make good models for protein sequences, and after a quick
examination of the literature, they were surprised to find that the young field
of bioinformatics was then ripe with the beginnings of this idea, but no one
had really put all the pieces together and exploited it. So they dove in and
built HMMs to recognize different protein families, demonstrating how this
methodology could unify the dynamic programming, grammatical, and
maximum-likelihood methods of statistical inference that were then becoming
popular in bioinformatics. David
Haussler says:
The epiphany that Anders and I had about HMMs for protein sequences was
certainly one of the most exciting moments in my career. However, I would have
to say that the most exciting moment came later, when my group participated in
efforts to sequence the human genome, and Jim Kent, then a graduate student at
UCSC, was the first person able to computationally assemble the public sequencing
project's data to form a working draft. We raced with Gene Myers' team at
Celera as they assembled their draft genome. On July 7, 2000, shortly after
Francis Collins and Craig Venter jointly announced in a White House ceremony
that their two teams had successfully assembled the human genome, we released
the public working draft onto the World Wide Web. That moment on July 7, when
the flood of A, C, T, and G of the human genome sequence came across my
computer screen, passing across the net as they were to thousands of other
people all over the world, was the most exciting moment in my scientific life.
For me, it was symbolic of the entire Human Genome Project, both public and
private, and the boundless determination of the many scientists involved. It seems unthinkable that out of a
primordial soup of organic molecules, life forms would eventually evolve whose
cellsw ould carry their vital messages forward in DNA from generation to
generation, ever diversifying and expanding their abilities as this message
became more complex, until one day one of these species would accumulate the
facility to decode and share its own DNA message. Yet that day had arrived.
Haussler's approach to
research has always been at the extreme interdisciplinary end of the spectrum.
Most of his insights have come from the application of perspectives of one
field to problems encountered in another.
I've tried to stay focused on the big
scientific questions, and never limit my approach to conform to the norms of any
well-established and narrow method of inquiry. I try to always listen carefully
to the few best scientists in all fields, rather than all the scientists in one
field.
Haussler
thinks that one key to discovery is picking the right problem. Important scientific problems become
ripe at particular times. Before that they are unapproachable because the
foundation required for their solution has not been laid. After that they are
no longer as important because the heart of the problem has already been solved.
Knowing when a scientific problemis ripe for solution is a difficult art,
however. Breadth of focus helps.
Luck doesn't hurt either. Haussler says:
We have not even really
begun to understand howa cellworks, so there are plenty of interesting problems
left for bioinformaticians. Computational models for this problem probably
won't look anything like what we have today. Problems associated with
understanding how cells form organs, bodies, and minds that function as they do
is also likely to keep anyone from being bored in this field for quite some
time. Applying what we learn to
the practice of medicine will pose further challenging problems. However, much
foundational works remains to be done before we can do justice to the loftiest
of these problems. One important
piece of foundational work where bioinformatics will play the key role is in
understanding the evolutionary history of entire genomes. With the full genome
sequences of many different species, we can begin to try to reconstruct the key
events in the evolution of our own genome using new comparative genomics
approaches. This will require more than raw genome data. It will demand the
development of new mathematics and algorithms appropriate to the task. People have always been fascinated with
origins, and the mystery of our own origins has been paramount among these
fascinations. Thus, I anticipate no lack of interest in such investigations.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-17 23:15
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社