xwu510的个人博客分享 http://blog.sciencenet.cn/u/xwu510

博文

Stop Name-calling and Distortions in AI Ethics Discussions

已有 3311 次阅读 2017-5-13 16:27 |系统分类:观点评述

In November 2016 we submitted to arXiv our paper “Automated Inference on Criminality Using Face Images”. It generated a great deal of discussions in the Internet and press. Recently, Arcas et al. published their article

“Physiognomy’s New Clothes” in the Medium website.  Although we agree with the authors on the importance of policing AI research for the general good of the society, we find how they mispresented our work, in particular the motive and objective of our research, deeply disturbing.  


Name calling


In Arcas et al.’s paper, the authors insinuate that we had an ill, racist motive.  This insinuation is so unmistakable that we were immediately subject to brutal verbal abuses on the Internet, and particularly in China. Nowhere in our paper we advocated the use of our method as a tool of law enforcement, nor our discussions moved from correlation to causality. It should be abundantly clear, for anyone who reads our paper with a neutral mind setting, that our only motive is to know if machine learning has the potential of acquiring humanlike social perceptions of faces, despite the complexity and subtlety of such perceptions depending on both the observed and the observer.  Our inquiry is to push the envelope and extend the research on automated face recognition from the biometric dimension (e.g., determining the race, gender, age, facial expression, etc.) to the sociopsychological dimension.  We are merely interested in the distinct possibility of teaching machines to pass the Turing test on the task of duplicating humans in their first impressions (e.g., personality traits, mannerism, demeanor, etc.) of a stranger.  The face perception of criminality was expediently (unfortunately to us in hindsight) chosen as an easy (by our intuition) test case, as we explained in our paper:


“For validating the hypothesis on the correlations between the innate traits and social behaviors of a person and the physical characteristics of that person’s face, it would be hard pushed to find a more convincing experiment than examining the success rates of discriminating between criminals and non-criminals with modern automatic classifiers. These two populations should be among the easiest to differentiate, if social attributes and facial features are correlated, because being a criminal requires a host of abnormal (outlier) personal traits. If the classification rate turns out low, then the validity of face-induced social inference can be safely negated.”


But shockingly, the Google authors intertwined the above passage into some of our honest observations and forced the following fancifully distorted meaning upon us:


"Those with more curved upper lips and eyes closer together are of a lower social order, prone to (as Wu and Zhang put it) “a host of abnormal (outlier) personal traits” ultimately leading to a legal diagnosis of “criminality” with high probability."


We agree that the pungent word criminality should be put in quotation marks. We should issue a caveat about the possible biases in the input data; taking a court conviction at its face value, i.e., as the “ground truth” for machine learning, was indeed a serious oversight on our part.  However, throughout our paper we maintain a sober neutrality on whatever we might find; in the introduction, we declare


"In this paper we intend not to nor are we qualified to discuss or debate on societal stereotypes, rather we want to satisfy our curiosity in the accuracy of fully automated inference on criminality. At the onset of this study our gut feeling is that modern tools of machine learning and computer vision will refute the validity of physiognomy, although the outcomes turn out otherwise."


We said we do not nor are qualified to interpret, loud and clear, but still get interpreted copiously by the Google authors.  This is not the way of academic exchanges that we are used to.   Now we came to regret our choice of the term “physiognomy”, the closest English translation for the Chinese term “面相学”. We were not sensitive enough to the inherent dirty connotation of the word in the English speaking world; merely using the term deserves the label of scientific racism?


Base Rate Fallacy


While the Google authors “are writing for a wide audience: not only for researchers …”, they conveniently did not see the clear symptom of “base rate fallacy” exhibited by nontechnical types in internet blogs and some media coverage.  Many reports and comments on our research overemphasize on high success rates (still in need of more rigorous validation) of our classifiers; they leap from these numbers to the grave danger of AI in general, and our methods in particular.  The base rate fallacy refers to the following pattern of invalid reasoning: the mind tends to focus on high specific probability (89% classification rate in our case) out of the context of the very low related background probability (0.3% crime rate in China).  


To illustrate the point to general public, we feel compelled to show how Bayesian statistical inference works, although it is a rudimentary knowledge to research communities.  If Xiaolin Wu is tested positive (just for fun) by our “criminality” classifier, his chance to commit a crime is


P(C|+) = P(+|C)*P(C)/[ P(+|C)*P(C)+P(+|N)*(1-P(C))]


where P(+|C)=0.89 is the probability that a convicted Chinese adult male is tested positive by our CNN face classifier, P(C)=0.0036  is the crime rate of China, and P(+|N)=0.07 is the probability that a non-criminal Chinese adult male is tested positive.  Plugging all these numbers into the Bayes formula, Wu is found to have a probability of only 3.68% to break the law, despite being tested positive.  Hopefully, this mathematical journey from 89% to 3.68% will put many of our critics at ease.  Here we stress again our strong opposition against any practical uses of our methods, not only because their accuracy falls far below any meaningful standard.


Base rate fallacy is an old trick used by media to sensationalize or exaggerate either the virtue or vice of new (unfamiliar/mysterious to general public) technological and scientific advances.  It can be easily manipulated to instill irrational fears in ordinary folks about the AI research.



Garbage in?


As much offended by the intellectually elitist tone of the Google authors, we agree with them on their progressive social values. There is really no need to parade infamous racists in chronic order with us inserted at the terminal node.  But the objectivity does exist, at least in theory, independent of whatever prevailing social norms.  


One of us has a Ph.D in computer science; we know all too well “garbage in and garbage out”.  However, the Google authors seemed to suggest that machine learning tools cannot be used in social computing simply because no one can prevent the garbage of human biases from creeping in.  We do not share their pessimism.  Like most technologies, machine learning is a neutral tool.  If it can be used to reinforce human biases in social computing problems as the Google authors argued, then it can also be used to detect and correct human biases (prejudice).  They worry about the feedback loop but conveniently do not see that the feedback can be either positive or negative. Granted, the criminality is a highly delicate and complex matter; however, well-trained human experts can strive to ensure the objectivity of the training data, i.e., rendering correct legal decisions independent of facial appearance of the accused.   If the labeling of training face images is free of human biases, then the advantages of machine learning over human judgment in objectivity cannot be denied.


Even in the presence of label noises, regardless they are random or systematic, scientific methods do exist to launder and restore/enhance credence to the results of statistical inferences.  We should not short change scientific knowledge for any shade of populism.  


Risk of Overfitting


Our critics are quick to point out the relatively small sample set used in our experiments and the risk of data overfitting. We are sorely aware of this weakness but cannot get more ID images of convicted Chinese males for obvious reasons (this Google article might have dashed all our hopes to enrich our data set).  However, we did make our best efforts to validate our findings in Section 3.3 of our paper, which opened as follows but completely ignored by the Google authors.    


“Given the high social sensitivities and repercussions of our topic and skeptics on physiognomy [19], we try to excise maximum caution before publishing our results. In playing devil’s advocate, we design and conduct the following experiments to challenge the validity of the tested classifiers …”


We randomly label the faces of our training set as negative and positive instances with equal probability, and run all four classifiers to test if any of them can separate the randomly labeled face images with a chance better than flipping a coin.  All face classifiers fail the above test and other similar, more challenging tests (refer to our paper for details).  These empirical findings suggest that the good classification performances reported in our paper are not due to data overfitting; otherwise, given the same size and type of sample set, the classifiers would also be able to separate randomly labeled data.


White Collar


Regarding to the question on the white-collared shirts in and not in the ID photos, we forgot to clarify that in our experiments of machine learning, we segment the face portion out of all ID images; the face-only images are used in training and testing.  


Nevertheless, the cue of white collar exposes an important detail that we owe the readers an apology. That is, we could not control for socioeconomic status of the gentlemen whose ID photos were used in our experiments.  Not because we did not want to, but we did not have access to the metadata due to confidentiality issues.  Now reflecting on this nuance, we speculate that the performance of our face classifiers would drop if the image data were controlled for socioeconomic status.  Immediately a corollary of social injustice might follow, we suppose.  In fact, this is precisely why we think our results have significance to social sciences.  


In our paper, we have also taken steps to prevent the machine learning methods, CNN in particular, from picking up superficial differences between images, such as compression noises and different cameras (Section 3.3).  


In conclusion, we appreciate all questions and discussions regarding our paper, but categorically reject the distortions of our intention, like “What Wu and Zhang’s paper purports to do is precisely that” (referring to James Weidmann), which are unprofessional and arrogant.




https://blog.sciencenet.cn/blog-3270119-1054734.html


下一篇:关于”基于人脸的自动犯罪性推测“的争论
收藏 IP: 202.120.19.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-11-20 04:23

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部