# A Guide to Statistics: t-score and mutual information

Here enclosed the instruction for Collins Wordbanks Online.

http://wordbanks.harpercollins.co.uk/Docs/Help/statistics.html

It'll remind me of the usefulness of t-score and MI score in corpus linguistics.

A Guide to Statistics: t-score and mutual information

There are two ways of measuring statistical significance in Collins' corpora: Mutual Information (MI) and t-score.

The Mutual Information score expresses the extent to which observed frequency of co-occurrence differs from what we would expect (statistically speaking). In statistically pure terms this is a measure of the strength of association between words x and y. In a given finite corpus MI is calculated on the basis of the number of times you observed the pair together versus the number of times you saw the pair separately.

MI does not work well with very low frequencies - the t-score provides a way of getting away from this problem as it also take frequencies into account. The t-score is a measure not of the strength of association but the confidence with which we can assert that there is an association. MI is more likely to give high scores to totally fixed phrases whereas t-score will yield significant collocates that occur relatively frequently. In most cases, t-score is the most reliable measurement.

Consider the example: "sour and puss". In the WordbanksOnline corpus, the word form "sour" occurs 4109 times and there are 254 hits for "puss". Here's a collocate list for the word form "puss":

As you can see "sour" only co-occur 3 times, this gives this particular collocation a very high MI score: i.e. these two words will be very strongly associated. However, the t-score says "maybe, but we haven't seen enough evidence to be sure that the MI is right!". As you can see, the t-score is relatively low: 1.73

Now consider a second example, "falling prices". In the WordbanksOnline corpus, each of these two words has a medium to high frequency: f("falling") = 23,209 and f("prices") = 66,352. We see the pair together 827 times as can be seen in the following screenshot.

The MI figure is not particularly high (8.415) because there is plenty of evidence of "falling" occurring without "prices" and vice versa. We are nowhere near 100% certainty that one will be accompanied by the other, so statistically the strength of association between "falling" and "prices" is much less than it was for "sour" and "puss". The t-score however is quite high at 28.673 compared to the previous example - it has taken into account the actual number of observations of the pairing to assess whether we can be confident in claiming an association.

In summary, MI will highlight any pair for which the frequency of cooccurrence is a high proportion of the overall frequency of either of the pair. Proper names, technical terms, idiosyncratic phrases and the like are well highlighted by MI (e.g. "post mortem", "post-menopausal", "post-grad"). "Yasser Arafat" will score very highly with MI because the chances of seeing either word without the other are very small. Similarly, locutions particular to some text, like, say, "beet residues", may come up with high MI scores. Why? Because neither "beet" nor "residues" will be very frequent overall in the corpus and it only needs one text on agriculture to use the phrase 4 times and bingo you have evidence for a strong association.

The t-score promotes pairings which have been well attested (i.e. there have been a decent number of cooccurrences). This is good for more grammatically conditioned pairs, like "depend on", or for stereotyped combinations which are not confined to particular subject fields or texts. Other t-score favourites would be "take stock" or "bad taste". In these cases the strength of association need not be large. But the confidence that there is *some* association is pretty high, because in a large corpus you may see "bad taste" 70 times. The t-score has a tendency to promote what you might consider to be uninteresting pairings on the basis of their high frequency of cooccurrence.

Raw frequency which corresponds to the first column in the images, often picks out the obvious collocates ("post office", "side effect") but you have no way of distinguishing these objectively from frequent non-collocates.

https://blog.sciencenet.cn/blog-331736-643223.html

## 全部精选博文导读

GMT+8, 2023-3-29 04:34