语料库翻译研究+认知空间分享 http://blog.sciencenet.cn/u/carldy 探索翻译研究新途径,反思语言认知研究

博文

Dice's coefficient & Jaccard index

已有 10873 次阅读 2012-10-10 10:15 |个人分类:身边的科学 Science around you|系统分类:科研笔记| Index, coefficient, Jaccard

The introduction comes from Wikipedia, the free encyclopedia.

 

Dice's coefficient, named after Lee Raymond Dice and also known as the Dice coefficient, is a similarity measure over sets:

s = .frac{2 | X .cap Y |}{| X | + | Y |}

It is identical to the Sørensen similarity index, and is occasionally referred to as the Sørensen-Dice coefficient. It is not very different in form from the Jaccard index but has some different properties.

The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function

d = 1 -  .frac{2 | X .cap Y |}{| X | + | Y |}

is not a proper distance metric as it does not possess the property of triangle inequality. The simplest counterexample of this is given by the three sets {a}, {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third.

Similarly to Jaccard, the set operations can be expressed in terms of vector operations over binary vectors A and B:

s_v = .frac{2 | A .cdot B |}{| A |^2 + | B |^2}

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:

s = .frac{2 n_t}{n_x + n_y}

where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night nacht

We would find the set of bigrams in each word:

{ni,ig,gh,ht} {na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

Jaccard index

The Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets.

The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

 J(A,B) = {{|A .cap B|}.over{|A .cup B|}}.

The MinHash min-wise independent permutations locality sensitive hashing scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function.

The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:

 J_{.delta}(A,B) = 1 - J(A,B) = { { |A .cup B| - |A .cap B| } .over |A .cup B| }.

This distance is a proper metric.

Similarity of asymmetric binary attributes

Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

M_{11} represents the total number of attributes where A and B both have a value of 1. M_{01} represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1. M_{10} represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0. M_{00} represents the total number of attributes where A and B both have a value of 0.

Each attribute must fall into one of these four categories, meaning that

M_{11} + M_{01} + M_{10} + M_{00} = n.

The Jaccard similarity coefficient, J, is given as

J = {M_{11} .over M_{01} + M_{10} + M_{11}}.

The Jaccard distance, J', is given as

J' = {M_{01} + M_{10} .over M_{01} + M_{10} + M_{11}}.

Tanimoto Similarity and Distance

Various forms of functions described as Tanimoto Similarity and Tanimoto Distance occur in the literature and on the Internet. Most of these are synonyms for Jaccard Similarity and Jaccard Distance, but some are mathematically different. Many sources cite an unavailable IBM Technical Report as the seminal reference.

In "A Computer Program for Classifying Plants", published in October 1960, a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto Similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard Distance.

Tanimoto's Definitions of Similarity and Distance

In that paper, a "similarity ratio" is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modelled. The definition of the ratio is the number of common bits, divided by the number of bits set in either sample.

Presented in mathematical terms, if samples X and Y are bitmaps, X_i is the ith bit of X, and  .land , .lor are bitwise and, or operators respectively, then the similarity ratio T_s is

 T_s(X,Y) =  .frac{.sum_i ( X_i .land Y_i)}{.sum_i ( X_i .lor Y_i)}

If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard Coefficient of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.

Tanimoto goes on to define a distance coefficient based on this ratio, defined over values with non-zero similarity:

T_d(X,Y) = -{.log} _2 ( T_s(X,Y) )

This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different to each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.

Other Definitions of Tanimoto Distance

Tanimoto Distance is often referred to, erroneously, as a synonym for Jaccard Distance ( 1 - T_s). This function is a proper distance metric. "Tanimoto Distance" is often stated as being a proper distance metric, probably because of its confusion with Jaccard Distance.

If Jaccard or Tanimoto Similarity is expressed over a bit vector, then it can be written as


f(A,B) =.frac{ A .cdot B}{{.vert A.vert}^2 +{ .vert B.vert}^2 -  A .cdot B }

where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector (where the value of each dimension is either 0 or 1) then A .cdot B = .sum_i A_iB_i = .sum_i ( A_i .land B_i) and {.vert A.vert}^2 = .sum_i A_i^2 = .sum_i A_i .

This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of  T_s do not necessarily extend to f. In particular, the difference function ( 1 - f) does not preserve triangle inequality, and is not therefore a proper distance metric, whereas ( 1 - T_s) is.

There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function (1 - f) is in fact a distance metric over vectors or multisets in general, whereas its use in similarity search or clustering algorithms may fail to produce correct results.

Lipkus uses a definition of Tanimoto similarity which is equivalent to f, and refers to Tanimoto distance as the function  (1 - f) . It is however made clear within the paper that the context is restricted by the use of a (positive) weighting vector W such that, for any vector A being considered,  A_i .in .{0,W_i.} . Under these circumstances, the function is a proper distance metric, and so a set of vectors governed by such a weighting vector forms a metric space under this function.

 



https://blog.sciencenet.cn/blog-331736-621006.html

上一篇:双节感言:事业、亲情一线牵
下一篇:伪科学=Pseudo-science,False science,Bad Science??
收藏 IP: 161.64.143.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

全部作者的精选博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-22 02:17

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部