|||
http://oldspace.biovip.com/500/spacelist-blog-itemtypeid-650.html
identity和similarity有什么区别,发现自己对这几个概念也不甚了了,于是做了点功课,如下。
第一反应 去查了BLAST的glossary
Identity
The extent to which two (nucleotide or amino acid) sequences are invariant.
Similarity
The extent to which nucleotide or protein sequences are related. The extent
of similarity between two sequences can be based on percent sequence identity
and/or conservation. In BLAST similarity refers to a positive matrix score.
但是BLAST的output里头没有similarity这一项,奇怪。
>sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, PLACENTAL (PAI-2)
(MONOCYTE ARG- SERPIN).
Length = 415
Score = 176 (80.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65
Identities = 38/89 (42%), Positives = 50/89 (56%)
Query: 1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQ
+I +LL S D DT +VLVNA+YFKG WKT F + PF V
Sbjct: 180 KIPNLLPEGSVDGDTRMVLVNAVYFKGKWKTPFEKKLNGLYPFRVNSA
然后找到下面这句话
Identities correspond to exact matches and positives are similarities based
on the scoring matrix used. (来自BLAST tutorial)
可见positivies就是某种修正过的similarities了。结合起来一看就清楚了,
identities->exact matches
positives->similarities based the matirx
在比较nucleotide seq时认为ATCG四个碱基出现机会相等,任何两个之间相同就得一分,替换后都得零分,一个非常简单的Substitution Matrix,这个时候identities和similarities(BLAST中就是positives)是相同的,因为用了这个简单的Substitution Matrix后,计算方法两者是一样的。在比较protein seq时Substitution Matrix用的是BLOSUM,相同的氨基酸得分高,相似的氨基酸得分低,不相匹配的的零分,这个时候identities和positives的计算方法是不一样的,所以两者也就不一样了。
至于统计上的similarity和生物学意义上的homology 又不一样了。想到这里又Google下了homology和similarity,嗯,很大一行字,Similarity is NOT equal to Homology,单独做了个网页强调这两个不是一回事,值得好好注意哦。
(2010.10.1)又看到有人评论,自己看了一下,Similarity is NOT equal to Homology的网页链接失效了,通过waybackmachine找了回来贴在下面。
Similarity is NOT equal to Homology
IDENTITY - The extent to which two sequences are invariant.
SIMILARITY - The extent to which sequences are related. Similarity makes no statement about descent from a common ancestor. (Convergent versus Divergent evolution.)
HOMOLOGY - Sequence similarity that can be attributed to descent from a common ancestor.
There are Two Types of Homology
ORTHOLOGOUS - Homologous sequences in different species. These sequences usually retain the same function in the two species.
PARALOGOUS
- Homologous sequences in the same species that arose by means of gene
duplication. Divergence of function is more common between paralogues.
Why is this important?
Homology is a matter of opinion, not directly measurable or observable.
Similarity is a direct measurement and can be discussed in terms of percentages.
(See Reeck et al. Cell 50(5): 667 (1987)
另外,Score 与bits-Score的区别:
http://www.cs.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec05/node17.html
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST Score
BLAST scores rely on extensive theory. We start by making the following assumptions:
The BLAST score is scoring local ungapped alignments. The theory of scoring here is well understood.
The database sequences are assumed to be evolutionary unrelated, i.e. independent of one another.
The alignment starts at specific positions along query and database record.
The
score matrix must give, on the average, a negative (a,b) score. Were
this not the case, long alignments would tend to have high score
independently of whether the segment aligned were related, and the
statistical theory would break down.
When searching a query of length m in a database of total length n one performs m*n random walk experiment, each with exponentially decreasing probability of achieving a score S. Thus, the E-value for score s is: . and K are constants:
- scaling factor
K - correction for dependency and bias of the scoring scheme.
Indeed the E-score is normalized by the length of the query and database: The same alignment would have different E-score if these length are different. Also the E-score is exponential, thus it is instructive to consider a normalization of the E-score into logarithmic scale, called the Bit - score.
The Bit-score B is computed from the E-score E by E=mn2-B. Obviously, the Bit-score is linear in the raw score s: .
In contrast to raw scores, that have little meaning without k and , the Bit-score is measured in standard units (see eg. [17]). Naturally, the meaning of the Bit-score depends on sizes of the query and the database.
Again,
as mentioned before one can ask for the P-value (the probability of the
observed number of records with a known E-value or lower).
Define the random variable Y to be the observed number of pairs achieveing E-value E or better(smaller).
Y is distributed Poisson with (E). The Probability of Ye to be r is ,
and the probability of Ye to be 0 is equivilant to the probability that
the (Best E-score < E)=exp (-E). Specifically the chance of finding
zero alignments with score >= S is e-E so the probability of finding
at least one such alignment is 1-e-E . This is the P-value associated
with the score S (see eg. [17]). Note that this model assumes an I.I.D trial for each database position.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-23 04:36
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社