|||
“我在1990年的一次会议上发表了时序推理的工作,而后Gene Lawler告诉我这个模型和DNA绘制问题符合的相当好。我那时对DNA什么都不知道,就开始阅读学习,我还给Eric Lander描述了有绘制数据的时序推理的结果。一些星期之后,Rutgers University组织了一个在计算生物学的开放日。Mike Waterman和Eric Lander来做了非常精彩的基因组计划和计算方面挑战的报告。Eric甚至提到我们的结果是计算机科学对这一领域做贡献的一个重要的例子。我认为他非常夸大了这一点,而这两个报告驱使我开始阅读生物学的教课书和文章。在第一年,我得到了我妻子Michal非常大的帮助,她恰好是一位生物学家,耐心地解答了我许多不起眼的问题。”
这个偶然的会议,Gene Lawer和Eric Lander把Shamir变成了生物信息学家。从1991年他在计算生物学的算法上投入了越来越多的时间。他仍然对图算法问题感兴趣,经常应用图理论技巧解决生物学问题。近些年,他趋向于把离散数学技巧和概率论结合起来,Ron说
“我仍然关注问题的复杂性和算法,也关心和热爱对一个问题NP完全性的证明。而生物学的训练使我当理论计算机科学方法论只能带你走那么远时,而数据却需要的更多的时候,‘容忍’启发式的方法。”
Shamir早期在生物信息学中提倡严密的聚类算法。虽然聚类算法理论在六十年代已经发展起来了,他对聚类的兴趣是在九十中期作为和在柏林的Max Plank Institute的Hans Lehrach的合作的一部分开始的。Lehrach是研究寡核苷酸指纹图谱技术之父,这个技术提早了DNA芯片技术。Lehrach在柏林有一个“工厂”,它能够排列成千的未知cDNA克隆,而后与k-mer(长度为k的DNA序列)探针进行杂交(一般是8-mers或9-mers)这些重复成打的探针,给出每一个cDNA克隆一个指纹图谱。为寻找哪些克隆对应相同的基因,就必须根据指纹图谱,对他们进行聚类,数据中的噪声水平使得精确聚类是个非常大的挑战。
正是这个挑战让Shamir开始思考聚类。由于聚类领域是非常混杂的,散布在不同的领域,他首先开始发展他的算法,而不是学习文献。很自然的,Shamir凭借图论算法,用聚类的单元(在指纹图谱的应用中是克隆)作为节点,被边连在一起的克隆是具有相似的指纹图谱,自然下一步是根据最小分割来重复分割图(最小分割就是那些去除之后使图失去连接的边的集合)。由此他的第一个聚类算法诞生了。事实上这个想法对Shamir来说那么自然,他的学生Erez Hartuv用了很长的时间查找文献,只是确认它还没有被发表,他们没有发现相同的方法,而是发现十年前有类似的想法,对Shamir来说:
“对于普遍存在的聚类问题,我一点也不奇怪在未来发现一篇考古,动物学或其他领域的旧文章有相同的想法。”
然后,在权衡了这个方法的利与弊之后,与他的学生Roded Sharan, Shamir改进了算法,加入了缜密的统计学分析,建立了这个广受欢迎的聚类算法CLICK,由于微阵列表达谱的大量数据的发表,这给它提供了巨大的机会,激发了Shamir与Amir Ben-Dor和Zoha Yakhini进一步开发了PCC和CAST。
Shimir是一位非常少的能够在算法和生物信息学的科研中领军人物的科学家之一,他在成为一个生物信息学家之前已经是算法领域有突破贡献的科学家,他说:
“可能在我的科学生涯中,最令我感到兴奋的时刻是我的1983年博士的结果,证明了Simplex算法的平均复杂性是二次的。Simplex算法是很常用的算法之一,在科学和技术领域差不多有成千的应用实例。从1950年开始,Simplex算法被认为在实践中非常有效率,过去和现在被普遍的应用着。可是Simplex对不同的变种的最差时间是指数的,有多个例子,没有人证明它有效率。这是一个尴尬的处境,解决它是非常可喜的。我与导师Ilan Ader(凸规划方向的领军人物,是Simplex算法之父Dantzig的学生)和Richard Karp(伟大理论计算机学家之一)一起对它进行了工作,这个合作是极端令人兴奋的。类似的结果在同一时间由其他的研究组同时得到。其它令人振奋的时刻,特别是在某天发现了一些非常吃惊的新结果,而第二天又发现其中的错误的之间的那些时刻,可这些正是科学研究懊恼的一种回报。”
Shimir把自己定位于不只投入研究一个特定的问题。在大学的设置中,科学研究的优势在于自由的选择某人的兴趣和嗅觉,而不是在同一问题中强迫工作很多年。好在生物学令人兴奋的快速进展提出了连续不断的新问题、想法和数据。当然,研究者不能因此呼吸影响研究深度的有害空气。实际上,除了“不投入”的想法,Ron发现自己在相同的科研领域持续工作几年的时间,跟踪这些领域的文献一段时间。他趋向于在几个不同领域的问题同时工作,并试图找出它们的一些交叉,这样就省去了理解和跟踪几个领域的努力,Shamir说:
“就我的观点,一个开放的头脑,敏感的嗅觉,在相关领域的坚实基础,以及努力的工作是科学发现中最重要的因素。就象法官说的那样,‘灵感只是出现在法律图书馆的凌晨三点时’,这在科学中更加真实。灵活性和敏感性对意外的发现是非常重要的。”
Shamir认为基因网络的研究是理解基因、蛋白质和其它分子如何象开演奏会一样工作的,这个问题将是一个长期的挑战。
原文:
Ron Shamir (born 1953 in Jerusalem) is a
professor at the School of Computer Science at Tel Aviv University. He holds an
undergraduate degree in mathematics and physics from Hebrew University, Jerusalem
and a PhD in operations research from the University of California at Berkeley.
He was a pioneer in the application of sophisticated graphtheoretical
techniques in bioinformatics.
Shamir's interests have always revolved around the design and analysis
of algorithms and his PhD dissertation was on studies of linear programming
algorithms. Later he developed a
keen interest in graph algorithms that eventually brought him (in a roundabout
way) to computational biology. Back in 1990 Shamir was collaborating with Martin
Golumbic on problems related to temporal reasoning. In these problems, one has
a set of events, each modeled by an unknown interval on the timeline, and a
collection of constraints on the relations between each two events, for
example, whether two intervals overlap or not. One has to determine if there
exists a realization of the events as intervals on the line, satisfying all the
constraints. This is a generalization of the problem that Seymour Benzer faced
trying to establish the linearity of genes, but at the time Shamir did not know
about Benzer's work. He says:
I presented my work on temporal reasoning at a
workshop in 1990 and the late Gene Lawler told me this model fits perfectly the
DNA mapping problem. I knew nothing about DNA at the time but started reading,
and I also sent a note to Eric Lander describing the temporal reasoning result
with the mapping consequences. A few weeks later Rutgers University organized a
kickoff day on computational biology and brought MikeWaterman and Eric Lander
who gave fantastic talks on the genome project and the computational
challenges. Eric even mentioned our result as an important example of computer
science contribution to the field. I think he was exaggerating quite a bit on
this point, but the two talks sent me running to read biology textbooks and
papers. I got a lot of help during the first years from my wife Michal, who
happens to be a biologist and patiently answered all my trivial questions.
This chance meeting with
Gene Lawler and Eric Lander converted Shamir into a bioinformatician and since
1991 he has been devoting more and more time to algorithmic problems in
computational biology. He still keeps an interest in graph algorithms and often
applies graph-theoretical techniques to biological problems. In recent years,
he has tended to combine discrete mathematics techniques and probabilistic
reasoning. Ron says:
I still worry about the complexity of the problems and the algorithms,
and would care about (and enjoy) the NP-hardness proof of a new problem, but
biology has trained m e to be "Äútolerant‚" to heuristics when the theoretical
computer science methodologies only take you so far, and the data require
more.
Shamir was
an early advocate of rigorous clustering algorithms in bioinformatics. Although clustering theory has been under
development since the 1960s, his interest in clustering started in themid-1990s
as part of a collaboration with Hans Lehrach of the Max Planck Institute in
Berlin. Lehrach is one of the fathers of the oligonucleotide fingerprinting
technology that predates the DNA chips techniques. Lehrach had a "factory"in Berlin that would array thousands of unknown cDNA clones, and then hybridize
them to k-mer probes (typically 8-mers or 9-mers). This was repeated with
dozens of different probes, giving a fingerprint for each cDNA clone. To find
which clones correspond to the same gene, one has to cluster them based on
their fingerprints, and the noise level in the data makes accurate clustering
quite a challenge.
It was this challenge that brought Shamir to think about clustering. As
clustering theory is very heterogeneous and scattered over many different
fields, he set out to develop his algorithm first, before learning the
literature. Naturally, Shamir
resorted to graph algorithms, modeled the clustered elements (clones in the
fingerprinting application) as vertices, and connected clones with highly
similar fingerprints by edges. It was a natural next step to repeatedly
partition the graph based on minimum cuts (sets of edges whose removal makes
the graph disconnected) and hence his first clustering algorithm was born.
Actually the idea looked so natural to Shamir that he and his student Erez
Hartuv searched the literature for a long time just to make sure it had not
been published before. They did not find the same approach 382 10 Clustering
and Trees but found similar lines of thinking in works a decade earlier.
According to Shamir,
Given the ubiquity of clustering, I would not be surprised to unearth in
the future an old paper with the same idea in archeology, zoology, or some
other field.
Later, after considering the pros and cons of the approach, together
with his student Roded Sharan, Shamir improved the algorithm by adding a
rigorous probabilistic analysis and created the popular clustering algorithm
CLICK. The publication of vast data sets of microarray expression profiles was
a great opportunity and triggered Shamir to further develop PCC and CAST
together with Amir Ben-Dor and Zohar Yakhini.
Shamir is one of the very few scientists who is
regarded as a leader in both algorithms and bioinformatics andwho was
creditedwith an important breakthrough in algorithms even before he became a bioinformatician.
He says:
Probably
the most exciting moment inmy scientific life was my PhD result in 1983,
showing that the average complexity of the Simplex algorithm is quadratic. The
Simplex algorithm is one of the most commonly used algorithms, with thousands
of applications in virtually all areas of science and engineering. Since the
1950s, the Simplex algorithm was known to be very efficient in practice and was
(and is) used ubiquitously, but worst-case exponential-time examples were given
for many variants of the Simplex, and none was proved to be efficient. This was
a very embarrassing situation, and resolving it was very gratifying. Doing it together with my supervisors
Ilan Adler (a leader in convex optimization, and a student of Dantzig, the
father of the Simplex algorithm) and Richard Karp (one of the great leaders of
theoretical computer science) was extremely exciting. Similar results were
obtained independently at the same time by other research groups. There were
some other exhilarating moments, particularly those between finding out an
amazing new result and discovering the bug on the next day, but these are the
rewards and frustrations of scientific research.
Shamir views himself as not committed to a
particular problem. One of the advantages of scientific research in the
university setting is the freedom to follow one's interests and nose, and not
be obliged to work on the same problem for years. The exciting rapid
developments in biology bring about a continuous flow of new problems, ideas,
and data. Of course, one has to make sure breadth does not harm the research
depth. In practice, in spite of this "uncommitted" ideology, Ron finds
himself working on the same areas for several years, and following the
literature in these fields a while later. He tends to work on several problem
areas simultaneously but tries to choose topics that are overlapping to some
extent, just to save on the effort of understanding and following several
fields. Shamir says:
In my opinion, an openmind, high antennae, a solid background in the
relevant disciplines, and hard work are the most important elements of
discovery. As a judge once said, "Inspiration only comes to the law library
at 3 AM".This is even more true in science. Flexibility and sensitivity
to the unexpected are crucial.
Shamir
views the study of gene networks, that is, understanding how genes, proteins,
and other molecules work together in concert, as a longstanding challenge.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-19 06:14
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社