moyebonnafide的个人博客分享 http://blog.sciencenet.cn/u/moyebonnafide

博文

My Research Journey

已有 2006 次阅读 2017-2-22 17:06 |系统分类:科研笔记

Changxuan Mao


The availability of large amount of electronically stored data and rapidly growing computing resources, particularly, distributed systems, encourages statisticians to use  sophisticated statistical models and infer both Euclidean and infinite-dimensional parameters in these models. Consequently, we observe that, nonparametric models have flooded mainstream statisticaljournals for decades. It is clear that,  from the perspective of a statistician, the present big data era provides enormous opportunities for both the academia and the public to recognize and comprehend theprominence of nonparametric models.


Sixteen years ago, as a PhD candidate, I was interested in genomic data generated from the high-throughput sequencing technology, which fitted in completely with the foremost development of nonparametric statistics. I worked on a multidisciplinary project, under the supervision of Claude dePamphilis (biologist),  Webb Miller  (computer scientist) and Bruce Lindsay (statistician). One issue is to estimate the numbers of expressed genes of plant tissues under different experimental conditions from data of expressed sequence tags (ESTs). When one treats an expressed gene as a “species” and an EST as an “individual”, this becomes a genomic instance of the well-known species problem. It refers to estimating the number of species in an assemblage from a sample of individuals with some species missed by the sample, and has established its importance due to its vast applications such as vocabulary of Shakespeare  (Bradley Efron andRonald Thisted), executions in South Vietnam  (Peter Bickel and Joseph Yahav), and families received unemployment benefits (Frederick Mosteller; Leo Goodman), and so on.

Later on, it turns out that the species problem is a quagmire, much more complicated than its seemingly simple description.  I will introduce my experiences in the species problem, its sibling (the population size problem) and their affiliations.

As genes are differentially expressed, their expression levels can be assumed to arise as a latent random sample from an unknown mixing distribution. Because the abundances vary over species, mixture models have been proposed since Ronald Fisher and his collaborators published an article about Malayan butterflies in 1943. A plethora of statistical tools has been developed in the literature of the species problem: jackknife estimator,  bootstrap estimator,  Horvitz-Thompson estimator,  coverage-basedestimator, martingale-based estimator, nonparametric maximum likelihood estimator, and so on.


However, we were stunned by the dramatic differences among the estimates for the number of species after we applied them to a genomic dataset about tomato flowers. No criteria were available for us to tell which estimate is deserving of our trust. We began to understand why John Bunge called the species problem a Gordian knot in 1993.


Interestingly,  Bernard Harris claimed that it is infeasible to estimate the number of species in 1959.  Irving Good made a similar claim in 1993. Although their claim was seldom discussed by researchers,  we were convinced by the intuitive arguments that support their infeasibility claim. In the meantime, we decided to prove it rigorously. This was indeed an ambitious task as several celebrated statisticians had touched the species problem and obtained no significant results.


As the first step towards cutting theGordian knot,  our task seems like a mission impossible.  To fix ideas,  we start ourjourney from reweighting the mixing distribution because  we must rely on the observed data. Although reweighting is a familiar technique in dealing with biased samples, it is ignored as the complete data are never treated as a biased sample. The reweighted mixing distribution can be inferred from the observeddata within a usual nonparametric mixture model.


The number of species can be easily estimated if the probability of a species being undetected is known orestimated. We replace the probability with its odds, because the odds is clearly a linear functional of the reweighted mixing distribution. The linearity brings us a lot of convenience. For example, when data are abundance-based and the nonparametric Poisson mixture model is adopted,  we have demonstrated that, the odds is an identifiable but discontinuous functional (with modulus of continuity being infinite),  over the space of mixtures of zero-truncated Poisson densities endowed with the L1 distance. The discontinuity of the odds establishes the infeasibility claim.


To summarize,  we have shown that the species problem is “singular” in accordance with the definition of Richard Liu and Lawrence Brown,  similar to estimating the number of modes of a continuous density (David Donoho).  Imagine that a wiggy density snugly dances around a slowly vary ingdensity, count their modes and calculate their Hellinger distance.  While it is customary for statisticians to present their discoveries starting from something like “under the regularity conditions”,  there should be a paradigm shift in the study of the species problem. Specifically,  the singularity of the odds impliesthat one can neither blindly use ordinary tools nor  dream of the existence of magic wands. Regretfully, to some colleagues working in this field, the paradigm shift is shocking and unacceptable.


What we should do is to redefine ourobjective. To this end,  we find that the odds functional is lowersemi-continuous  (e.g., with a closed epigraph) and admits lower bounds. We propose two approaches to construction of lower bounds:  an algebraic approachin which a moment sequence  and the corresponding Hankel matrices are constructed from a mixture density and some deep results in the Stieltjes moment problem are applied to build quadratic forms; a geometric approach in which an extended partial density curve  (EPD-curve, conceptually similar to the well-known moment curve) is introduced and an optimization problem is defined over the convex hull of the EPD-curve. The discretized version of the optimization problem can be easily solved by linear programming.


A sibling of the species problem is estimating the size of a population based on data arising from capture-recapture,  removal sampling,  catch-effort studies and other experiments. All can be understood as modified and upgraded versions of the textbook example “how many fishes in a pond”, tracing back to C. G. Johannes Petersen’s work in 1896. A population can refer to an animal population  (e.g., fishes in a pond, tigers in a forest),  a human population (e.g., cancer patients, drug addicts), or a population of abstract objects (e.g., typos in an article, bugs in a software application). Various nonparametric mixture models can be used: Poisson, binomial, geometric, binned exponential, product binomial, and so on. The techniques that we have developed within the nonparametric Poisson mixture model can be applied and adjusted.


A zero-truncated Poisson density has infinitely many support points, necessary for the odds to be identifiable. In the nonparametric mixture model having components densities with finitely manysupport points (e.g., upper-truncated geometric),  the challenge arises from thefact that the odds is non-identifiable in the sense that multiple values of the odds are possibly associated with the same mixture density. The non-identifiability results from a non-empty interior of the convex hull of an extended density curve, which undoubtedly means that it is infeasible to estimate the population size.


To cut the Gordian knot perfectly, we notice that both the algebraic and geometric approaches yield lower bound functionals that approximate the odds functional. Actually, all pre-existing  estimators can be shown to have asymptotic limits that are continuous functionals of mixture densities and approximate the discontinuous or non-identifiable odds. Certainly, the quality of each approximation functional demands a thorough investigation. This can be trivial for some estimators and challenging forothers.  For example,  we consider using hierarchical log-linear models that have been almost a preferred and required approach when epidemiologists deal with data from multiple lists and wish to estimate the number of individuals missed by all lists. After establishing the connection between a hierarchical log-linear model and a monotone Boolean function, we find that, the bias of the estimator for the odds can be phenomenal, because it is an exponential function in the highest order interaction necessarily assumed to be zero.


Furthermore, one can construct statistical functionals and estimate them for the purpose of addressing affiliated issues. For example, the coverage (e.g., the total probability of all observed species) is arational function of density values and a functional of the Poisson mixture density. For another example, we have introduced a species accumulation curve for incidence-based data. The ordinate at each abscissa is a functional of the binomial mixture density. The estimated curve has been named as Mao’s Tau by RobertColwell  (ecologist) and popularized immediately because ecologists have waitedfor its appearance since 1923.


Finally, although the species problem, the population size problem and their affiliations have consumed a large proportion of my time, I have also investigated additional theoretical statistical problems, and collaborated with scientists and engineers in many disciplines, including ecology, agriculture, bioinformatics, public health, environmental science, education, social security, and so on. In addition, I had worked in AT&T Labs Research as a Principal Member of Technical Staff,  processing AT&T internal data and external data from Duns & Bradstreet. Such a period of industrial experiences is precious to a scholar in academia.




https://blog.sciencenet.cn/blog-3282187-1035331.html


下一篇:My Research Journey - 人物介绍
收藏 IP: 180.155.137.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-10-20 03:07

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部