Big Brother is watching you (Pattern Recognition in everyday lives) !
In this electronics-centric and globalized world, there is very little that goes on in our lives that is not known to someone else or collected in some databases. I wrote earlier in a blog article http://www.sciencenet.cn/m/user_content.aspx?id=35816
about “On Privacy and the Digital Revolution”. But many of us are still comforted by the thought that no one will take the trouble to analyze these zillions of data collected everyday. So long as we live a normal life and not engaged in crime or terrorism, no one will care. And even if we do, the chance are small for detection. But the science of pattern recognition makes this belief a somewhat chancy proposition.
So what is Pattern Recognition (PR)? And how does it work?
Intuitively and informally we all have an idea how PR works. Scientific discovery (e.g. Darwin’s theory of evolution and natural selection) happens when someone observed some phenomena, thought about them, and derived a concise explanation of them. We form opinions about human behavior by seeing what people do. Military intelligence depends on piecing together seemingly disparate data and observations. The theory of PR is simply a codification of these intuitive and informal ideas into precisely defined rules and algorithms so that the vast power of computers can be used in place of human efforts. I shall briefly explain below what I know about this codification process as popular science. In the process, perhaps we will also gain a glimpse on how science and technology are increasingly taking over human endeavors.
Roughly speaking there are four ingredients and steps in PR – feature selection, training and classification, generalization, syntax analysis.
1.Feature Selection – In principle, once you have digitized whatever you want to pattern-recognize, you can simply process the resultant bits. But this is highly inefficient or impossible. For example, digital data about a human face involves billions of bits. What you need to do first is to abstract from these raw bits some “features” that can be used to characterize the data, such as, the color of the skin, and the shape of the eyes and eyelids. These two features might be useful in recognizing whether or not a face belong to an Asian or non-Asian. How to select the correct and the minimal number of features is an important problem of PR.
2.Training for classification – Again in principle any PR problem can be reduced to a series of yes-no classification problem (remember the “20 questions” game we often play as a child?). Let us then consider the classification problem of differentiating between an Asian face and a non-Asian face. Assume we shall use the two features – color of the skin, and shape of the eyes (simplification for explanation purposes only) to represent a face, and we have a large collection of samples of Asian and non-Asian faces. Furthermore if these two features can be converted to numerical scales; we randomly choose half of this set of sample faces, and plot the data set in two dimensions. A typical situation of figure 1 (see beloow) may result where squares denote samples of non-Asian face and circles are that of Asian faces.
To implement a classifier means to device a two dimensional curve that will separate the squares from the circles. The simplest curve is a straight line which can be characterized by two parameters. Using the samples we can device a successive approximation scheme (called training) which can gradually adjust the value of the parameters of the line such that it can separate the two groups of data with minimal errors as shown in Fig. 2. (see below)
Of course at this point astute readers may question why not use a more complex curve so that an even better separation can result as in Fig.3. (see below)
Good question! The answer lies in the issue of
3.Generalization – The quality or goodness of a classifier must be tested on new samples previously not used for training purposes. Thus once we designed a classifier (i.e., in our case, fixed the value of the two parameters characterizing the separating line in Fig. 2), we need to test it on the remaining half of the sample data to see if it performs equally well as it did on the training data, i.e., the ability to “generalize” its classifying property on new data. This is a well known problem in the “Statistics” literature – how well does the sample property predict the true property? For example, if the sample size used to calculate whatever property is small then the predictive quality of the sample property thus calculated will be poor, i.e., it cannot generalize. Thus, when we use a more complex curve to do the classification on the samples, the number of characterizing parameters of the curve must be “small” relative to the sample size. Thus, if we have to add three more parameters for the curve in order to eliminate errors on the classifications of , say, two pieces of samples data, this is probably not a good idea. But if we can eliminate three hundred pieces of errors, then we should do it.
4.Syntax Analysis - The phrase “the context makes it clear” denotes the fact that the meaning of words often depends on how it was used, e.g., neighboring words in a sentence. Thus, in classifying “faces” we may want to look at the surrounding scenes in which we find the “face”. If we find many Asian “features” such as Chinese words and signs, then the classifying probability of an Asian face should probably be increased. Put it another way, we can escalate PR to a higher level and practicing the art of “feature selection-training-generalization” on increasing amount of data. This is of course a never ending problem which is why general language translation and artificial intelligence problem remain unsolved. However, in more limited and prescribed situation of PR for a particular purpose much can be and have been done.
Examples of PR being used in our daily lives are numerous in the US: From our supermarket shopping, companies know a great deal of the foods we eat and can target advertisements to families individually. So does Amazom.com on books and things you purchase. Magazines can print individual ads for the copy we receive since they know our likes and dislikes from similar databases. I have no doubt that all e-mail traffic between China and US are monitored by both sides for intelligence purposes. However, with ever increasing computing power and ever more collection of electronic data, who is to say that spare computing capacity won’t be put to use to discover whatever patterns Big Brother wants to know. Your Google search and cell phone records are all available to the government. There is no privacy anymore.
Notes added 7/15/09 . A note in Campus Technology newsletter today reported the following example of PR:
Carnegie Mellon Researchers Find SocialSsecurityNumberss Can Be Predicted Carnegie Mellon University researchers have shown that public information gleaned from governmental sources, commercial databases, and online social networks can be used to routinely predict most--and sometimes all--of an individual's nine-digit Social Security number. http://www.1105newsletters.com/t.do?id=2951907:780897