葡萄皮的个人空间分享 http://blog.sciencenet.cn/u/Hadron74

博文

介绍国外几位生物信息学家(4)~~David Sankoff

已有 6847 次阅读 2011-5-20 07:08 |个人分类:生物信息|系统分类:科研笔记| 信息学

David Sankoff, 现任加拿大University of Ottawa的数学基因组学研究主席。他在McGill University大学学习,并与Donald Dawson一起研究概率理论获得博士学位,撰写了关于历史语言学的概率模型的博士论文。他1969年加入在University of Montred的数学研究中心(Centre de recherches mathematiques, CRM),1984到2002年,他同时兼任数学和统计学系的教授。他是生物信息的创始之父之一,奠基性的贡献可以追溯到七十年代早期。

Sankoff受过数学和物理学的训练;然而他的1960年的大学暑假却是在University of Toronto的微生物实验室度过的,来帮助病毒领域的实验,同时每个晚上和周末都在图书馆阅读生物学杂志。那是非常令人振奋的时期,不需要太多的背景和知识就可以赶上分子生物学文献的进展,Watson-Crick的模型只有十年的时间,解码遗传密码子仍没有完成,mRNA刚刚被发现。有了这些经验,在几年以后,Scankoff没费多大力气就联系到了一个对应用计算机来解决分子生物学问题有远见卓识的生物化学家Robert, J., Cedergren

1971年,CedergrenSankoff想办法来比对RNA序列。Sankoff对算法设计几乎一无所知,对离散动态规划根本不知道,作为一个大学生,他有效地采用了后一种方法解决了买方卖方匹配的经济学问题,这个方法也同样适用于序列联配。BabDavid开始被这个问题所吸引,对联配和其它问题进行了统计学的检验,幸运的是他们发现NeedlemanWunsch已经为生物序列比较发表了一个动态规划的技巧。

SankoffCedergren的工作中,一个新的问题又很早的出现了:多序列联配以及它的分子进化相关问题。Sankoff在他以前在语言家族的研究中已经熟悉了系统发生学问题,并很早就参加了数值分类学的会议(这是Steve Farris领导下在简约促进分析的分支主义者(parsimony-promoting cladist)和更多具有统计学背景的系统学家面前的会议)。把序列比对和系统发生学联系到一起,他导出了一个基于树的动态规划算法。虽然系统发生学问题是在Sankoff的研究项目中突然发生的问题,但对这个问题的研究一直继续了之后的几十年。

SankoffCedergren也研究了RNA折叠的问题,用几步动态规划算法,可以建立能量优化的RNA结构。它们不符合找出由Daniel Kleitman小组找到的环结构(后来经过更一般化的整合,得到的更通用的算法是由Michael Zuker给出的)。但是在对特别是多环的问题和折叠与联配同时进行的问题,他们的方法甚至到八十年代还能给出可观的贡献。Sankoff:

“我和Cedergen的合作也从此走入了共同的死胡同。应用多维缩放的方法到核糖体结构,没有得到非常好的结果。试图通过tRNA序列的系统发生学的分析追溯遗传密码的起源的努力甚至没有效果。尝试对蛋白质共有序列折叠的动态规划彻底地失败了。”

七十年代早期和中期无疑是Sankoff具有高产出的时期:他同时在自然语言语法变化的统计分析领域,在选举过程的博弈理论中,在考古学、地理学和物理学中各种各样的应用数学项目中进行了工作。Sankoff认识了Peter Seller,使他对序列比对产生了兴趣,Seller后来做出了用转换最长公用子序列(longest common subsequence, LCS)的组成来编辑距离版本的工作。Sankoff与著名的数学家Vador Chvatal在估计两条随机序列的LCS的期望长度上做了工作,他们导出了它的上限和下限。之后几代的概率学家都试图为缩小这个限制区间进行了贡献。Sankoff说:

“进化生物学家Walter FitchSteve Farris在休假期间在CRM和我进行研究。同时,计算科学家Bill Day非常慷慨地在一系列确认各种各样的系统发生学问题的难度,特别是重要的简约类问题的文章中加上了我的名字。”

1987年,Sankoff成为新成立的(Canadian Institute for Advanced Research, CIAR)的进化生物学计划的一员。在CIAR计划的第一次会议上,他受Monique的邀请做了关于两种藻类叶绿体基因组比较的报告。这意味着Sankoff开始研究比较基因组学和研究基因组重排追溯的问题,这将成为他以后主要研究的问题。开始时,他采用概率论的方式进行研究,在一两年之中,他试图来发展算法和程序来处理距离的倒数。在面向16s线粒体基因组的距离倒数的系统发生学分析中,证明了强的系统发生学信号在这个在几千万年时间尺度上的微小的基因组的顺序上是非常保守的。Sankoff说:

“在CTAR计划的学者和专家的网络包括了Bob Cedergren,  Ford Doolittle,  Franz Lang,  Mike Gray,  Brian Golding,  Mike Zuker,  Claude Lemieux和其他广布于加拿大的人,主要的国际建议者(比如Russ Doolittle,Micheal Smith和其它的人)和有交往者(Mike Waterman, Joe Felsenstein, Mike Sted 和其它人)。CTAR计划成为我事实上的“家单位”(home deparment), 一个智力支持上,知识上,有多重训练的经验及最新想法的智囊团。

我的比较基因组研究在九十年代得到了两个主要的进展。一个是在一系列出色的学生和博士后支持下合作完成的,他们是Guillaume,Leduc,Vincent Ferretti, John Kececioglu,Mathieu Blanchette, Nadia El-MabroukDavid Bryant;另一个是我遇到了Joe Nadeau,我已经知道了他和Talar的有创意的文章,其中估计了保守连接片段的个数。我发现我们的兴趣非常一致,而我们的背景又是互补的。”

Nadeau出现在到MontrealMcGill的人类遗传学作短期访问时,与Sankoff的相处没有超过一个小时,他与Sankoff就开始了重要的合作。他们重组了Nadeau-Talor的方法到基因相关的数据中,把它从物理学和遗传学的距离的计量方法中解放了出来,导致简化的模型使他们能够彻底地通过Nedeau-Taylor模型的数学特性来进行探索,使之实验结果与它们脱离联系。

算法和比较基因组统计特性的综合作用成为Sankoff用来理解进化的基础。基于更坚实的假说和程式化但不可改变的数学的基础,该算法具有进一步推断的能力。概率更具有描述性,但对历史进程具有更少的明确的启发性,而基于统计学的模型更容易推导出它们假设的减弱或加强,以及确定结果的鲁棒性。Sankoff的观点使这种方法脱颖而出,使得整个基因组比对这个领域成为现在和不久以后的将来成为最令人感兴趣的一个课题。

“我研究问题的方法不是很有计划性,不是我在一般方向上如何做没有一个计划,而是我没有理所当然的一些工具要用,我只是采用直觉来寻找哪些方法或模型的方式,寻找哪些数据库和显示是有帮助的。如果我幸运,我可以从一些小的显现着手到另一个出现,每次找到一些细节,直到一些更清晰的故事的出现。是否这些研究涉及随机过程、组合优化、或者微分方程是第二位的,这是问题的生物性来指导它的数学组成。我很少有动机去研究那些已经研究了很久的问题。在我开始注意到联配问题时,它不是生物学家们和计算机学家们热门关注的事物,在十五年以前基因组重组也不是。令我非常高兴,虽然有时候也困惑的是:现在计算生物学家和生物信息学家是名符其实的大潮泛滥在这个领域,而在三十,甚至二十年前,只有非常少的单独研究者。”


原文:

David Sankoff currently holds the Canada Research Chair in Mathematical Genomics at the University of Ottawa. He studied at McGill University, doing a PhD in Probability Theory with Donald Dawson, and writing a thesis on stochastic models for historical linguistics. He joined the new Centre de recherches mathématiques (CRM) of the University of Montreal in 1969 and was also a professor in the Mathematics and Statistics Department from 1984–2002. He is one of the founding fathers of bioinformatics whose fundamental contributions to the area go back to the early 1970s.
    Sankoff was trained in mathematics and physics; his undergraduate summers in the early 1960s, however, were spent in a microbiology lab at the University of Toronto helping out with experiments in the field of virology and whiling away evenings and weekends in the library reading biological journals. It was exciting, and did not require too much background to keep up with the molecular biology literature: the Watson-Crick model was not even ten years old, the deciphering of the genetic code was still incomplete, and mRNA was just being discovered. With this experience, Sankoff had no problems communicating some years later with Robert J. Cedergren, a biochemist with a visionary interest in applying computers to problems in molecular biology.
    In 1971, Cedergren asked Sankoff to find a way to align RNA sequences. Sankoff knew little of algorithm design and nothing of discrete dynamic programming, but as an undergraduate he had effectively used the latter in working out an economics problem matching buyers and sellers. The same approach worked with alignment. Bob and David became hooked on the topic, exploring statistical tests for alignment and other problems, fortunately before they realized that Needleman and Wunsch had already published a dynamic programming technique for biological sequence comparison.

A new question that emerged early in the Sankoff and Cedergren work was that of multiple alignment and its pertinence to molecular evolution.Sankoff was already familiar with phylogeny problems from his work on language families and participation in the early numerical taxonomy meetings (before the schism between the parsimony-promoting cladists, led by Steve Farris, and the more statistically oriented systematists). Combining phylogenetics with sequence comparison led to tree-based dynamic programming for multiple alignment. Phylogenetic problems have cropped up often in Sankoff’s research projects over the following decades.
    Sankoff and Cedergren also studied RNA folding, applying several passes of dynamic programming to build energy-optimal RNA structures. They did not find the loop-matching reported by Daniel Kleitman’s group (later integrated into a general, widely-used algorithm by Michael Zuker), though they eventually made a number of contributions in the 1980s, in particular to the problem of multiple loops and to simultaneous alignment and folding. Sankoff says:
    My collaboration with Cedergen also ran into its share of dead ends. Applying multidimensional scaling to ribosome structure did not lead very far, efforts to trace the origin of the genetic code through the phylogenetic analyses of tRNA sequences eventually petered out, and an attempt at dynamic programming for consensus folding of proteins was a flop.
    The early and mid-1970s were nevertheless a highly productive time for Sankoff; he was also working on probabilistic analysis of grammatical variation in natural languages, on game theory models for electoral processes, and various applied mathematics projects in archaeology, geography, and physics. He got Peter Sellers interested in sequence comparison; Sellers later attracted attention by converting the longest common subsequence (LCS) formulation to the edit distance version. Sankoff collaborated with prominent mathematician Vaclav Chvatal on the expected length of the LCS of two random sequences, for which they derived upper and lower bounds. Several generations of probabilists have contributed to narrowing these bounds. Sankoff says:
    Evolutionary biologists Walter Fitch and Steve Farris spent sabbaticals with me at the CRM, as did computer scientist Bill Day, generously adding my name to a series of papers establishing the hardness of various phylogeny problems, most importantly the parsimony problem.
    In 1987, Sankoff became a Fellow of the new Evolutionary Biology Program of the Canadian Institute for Advanced Research (CIAR). At the very first meeting of the CIAR program he was inspired by a talk by Monique Turmel on the comparison of chloroplast genomes from two species of algae. This led Sankoff to the comparative genomics–genome rearrangement track that has been his main research line ever since. Originally he took a probabilistic approach, but within a year or two he was trying to develop algorithms and programs for reversal distance. A phylogeny based on the reversal distances among sixteen mitochondrial genomes proved that a strong phylogenetic signal can be conserved in the gene order of even a miniscule genome across many hundreds of millions of years. Sankoff says:
    The network of fellows and scholars of the CIAR program, including Bob Cedergren, Ford Doolittle, Franz Lang, Mike Gray, Brian Golding, Mike Zuker, Claude Lemieux, and others across Canada; and a stellar group of international advisors (such as Russ Doolittle, Michael Smith, Marcus Feldman, Wally Gilbert) and associates (Mike Waterman, Joe Felsenstein, Mike Steel and many others) became my virtual “home department," a source of intellectual support, knowledge, and experience across multiple disciplines and a sounding board for the latest ideas.
    My comparative genomics research received two key boosts in the 1990s. One was the sustained collaboration of a series of outstanding students and postdocs: Guillaume Leduc, Vincent Ferretti, John Kececioglu, Mathieu Blanchette, Nadia El-Mabrouk and David Bryant. The second was my meeting Joe Nadeau; I already knew his seminal paper with Taylor on estimating the number of conserved linkage segments and realized that our interests coincided perfectly while our backgrounds were complementary.

    When Nadeau showed up in Montreal for a short-lived appointment in the Human Genetics Department atMcGill, it took no more than an hour for him and Sankoff to get started on a major collaborative project. They reformulated the Nadeau-Taylor approach in terms of gene content data, freeing it from physical or genetic distance measurements. The resulting simpler model allowed them to thoroughly explore the mathematical properties of the Nadeau-Taylor model and to experiment with the consequences of deviating from it.
    The synergy between the algorithmic and probabilistic aspects of comparative genomics has become basic to how Sankoff understands evolution. The algorithmic is an ambitious attempt at deep inference, based on heavy assumptions and the sophisticated but inflexible mathematics they enable. The probabilistic is more descriptive and less explicitly revelatory of historical process, but the models based on statistics are easily generalized, their hypotheses weakened or strengthened, and their robustness ascertained. In Sankoff’s view, it is the playing out of this dialectic that makes the field of whole-genome comparison the most interesting topic of research today and for the near future.
    My approach to research is not highly planned. Not that I don’t have a vision about the general direction in which to go, but I have no specific set of tools that I apply as a matter of course, only an intuition about what type of method or model, what database or display, might be helpful. When I am lucky I can proceed from one small epiphany to another, working out some of the details each time, until some clear story emerges. Whether this involves stochastic processes, combinatorial optimization, or differential equations is secondary; it is the biology of the problem that drives its mathematical formulation. I am rarely motivated to research well-studied problems; instead I find myself confronting newproblems in relatively unstudied areas; alignment was not a burning preoccupationwith biologists or computer scientists when I started working on it, neither was genome rearrangement fifteen years later. I amquite pleased, though sometimes bemused, by the veritable tidal wave of computational biologists and bioinformaticians who have inundated the field where there were only a few isolated researchers thirty or even twenty years ago.



https://blog.sciencenet.cn/blog-565112-446013.html

上一篇:介绍国外几位生物信息学家(3)~~Gary Stormo
下一篇:介绍国外几位生物信息学家(5)~~Michael Waterman
收藏 IP: 124.207.171.*| 热度|

1 dunkelblau

发表评论 评论 (2 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-18 20:14

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部