Overview of Natural Language Processing (NLP) Core Engine【This document provides a text version of Dr. Wei Li's overview of NLP, presented on August 8, 2012.】At a high level, our NLP core engine reads sentences and extracts insights to support our products. The link between the products and the core engine is the storage system. Today’s topic is on the workings of the NLP core engine.System OverviewOur NLP core engine is a two-component system.The first component is a parser, with the dependency tree structure as output, representing the system’s understanding of each sentence. This component outputs a system-internal, linguistic representation, much like diagramming taught in grammar school. This part of the system takes a sentence and “draws a tree of it.” The system parses language in a number of passes (modules), starting from a shallow level and moving on to a deep level.The second component is an extractor, sitting on top of the parser and outputs a table (or frame) that directly meets the needs of products. This is where extraction rules, based on sub-tree matching, work, including our sentiment extraction component for social media customer insights.Dependency Tree Structure and FramesAn insight extractor of our system is defined by frames. A frame is a table or template that defines the name of each column (often called event roles) for the target information (or insights). The purpose of the extraction component is to fill in the blanks of the frame and use such extracted information to support a product.Each product is supported by different insight types, which are defined in the frame. To build a frame, Product Management determines what customers need and what output they want from processing sentences and uses the output information to formulate frame definitions. The NLP team takes the product-side requirements, does a feasibility study, and starts the system development, including rules (in a formalism equivalent to an extended version of cascaded finite state mechanism), lexicons and procedures (including machine learning for classification/clustering), based on a development corpus, to move the project forward. The frames for objective events define things like who did what when and where etc with a specific domain or use scenario in mind. The frames for sentiments or subjective evaluations contain information first to determine whether a comment is positive or negative (or neutral, in a process called sentiment classification). It also defines additional, more detailed columns on who made the comment on what to what degree (passion intensity) in which aspects (details) and why. It distinguishes an insight that is objective (for example, “cost-effective” or “expensive”) from subjective insight (for example, "terrific", “ugly” or “awful”).The type of insight extraction is based on the first component of linguistic processing (parsing). More specifically, the insight extraction is realized by sub-tree matching rule in extraction grammars. In this example:Apple launched iPhone 4s last monthThe parser first decodes the linguistic tree structure, determining that the logical subject (actor) is “Apple,” the action is “launch,” the logical object (undergoer) is “iPhone 4s,” and “last month” is an adverbial. The system extracts these types of phrases to fill in the linguistic tree structure as follows.Based on the above linguistic analysis, the second component extracts a product launch event as shown below: <Product-Launch Frame> < Company="Apple"> < Pred ="launched"> <Product ="iPhone 4s"> <When="last month">How Systems Answer QuestionsWe can also look at our system from the perspective of how it addresses users information needs, in particular, how it answers questions in our mind. There are two major systems for getting feedback to satisfy users’ information needs.
Traditional systems, like search engines. A user enters a query into a search engine and gets documents or URLs related to query keywords. This system satisfies some needs, but there is too much information and what you want to know might be buried deep in the data.
NLP-based systems, which can answer users’ questions. All our products can be regarded as special types of “question-answering systems.” The system reads everything, sentence by sentence. If it has a target hit, it can pull out answers from the index to the specified types of questions.
Technology for answering factoid questions, such as when (time), where (location), who (person) is fairly mature. The when-question, for example, is easy to answer because time is almost always expressed in standard formats. The most challenging questions to answer are “how” and “why.” There is consensus in the question answering community on this. To answer “how” questions, you might need a recipe, a procedure, or a long list of drug names. To answer “why,” the system needs to find motivation behind sentiment or motive behind behavior.
Our products are high-end systems that are actually designed to answer “how” and “why” questions in addition to sentiments. For example, if you enter “heart attack” into our system, you get a full solution package organized into sections that includes a list of procedures, a list of drugs, a list of operations, the names of doctors and professionals, etc. Our consumer insight product classify sentiments, otherwise known as “thumbs-up” and “thumbs-down” classification, just like what our competitors do. But we do much more fined-grained and much deeper, and still scale up. Not only can it tell you what percentage, what ratio, how intensively people like or dislike a product, it also provides answers for why people like or dislike a product or a feature of a product. This is important: knowing how popular a brand is only gives a global view of customer sentiments, but such coarse-grained sentiments by themselves are not insightful: the actionable insights in the sentiment world need to answer why questions. Why do customers like or dislike a product feature? Systems that can answer such questions provide invaluable actionable insights to businesses. For example, it is much more insightful to know that consumers love the online speed of iPhone 4s but are very annoyed by the lack of support to flash. This is an actionable insight, one that a company could use to redirect resources to address issues or drive a product’s development. Extraction of such insights is enabled by our deep NLP, as a competitive advantage to traditional classification and clustering algorithms, practiced by almost all the competitions who claim to do sentiments.
Q&A
Q: How do you handle sarcasm?
A: Sarcasm is tough. It is a challenge to all the systems, us included. We have made some tangible progress and implemented some patterns of sarcasm in our system. But overall, it is a really difficult phenomenon of natural language. So far in the community, there is only limited research in the lab, far from being practical. People might say “no” when they mean “yes,” using a “zig-zag” way to express their emotions. It’s difficult enough for humans to understand these things and much more difficult for a machine.
The good news is that sarcasm is not that common overall, assuming that we are considering a large amount of real-life data. There are benchmarks in literature about what percentage of sarcastic data occurs in real-life language corpora. Fortunately, only a small fraction of the data might be related to sarcasm, often not making a statistical impact on data quality, whether or not it is captured.
Not all types of sarcasm are intractable. our products can capture common patterns of sarcasm fairly well. Our first target is sarcasm with fairly clear linguistic patterns, such as when people combine “thank you” (a positive emotion) with a negative behavior: “Thank you for hurting my feelings.” Our system recognizes and captures this contradictory pattern as sarcasm. “Thank you,” in this context, would not be presented as a positive insight.
Q: Do you take things only in context (within a sentence, phrase, or word) or consider a larger context?
A: Do we do anything beyond the sentence boundary to make our insights more coherent to users? Yes, to some extent, and more work is in progress. The index contains all local insights, broken down into “local” pieces. If we don’t put data into the index piece by piece, users can’t “drill down.” Drill-down is a necessary feature in products so the users can verify the insight sources (where exactly the insight is extracted from) and may choose to dive into a particular source.
After our application retrieves data from the index, it performs a “massaging” phase that occurs between retrieving the data storage and displaying it. This massaging phase introduces context beyond sentence and document boundaries. For example, “acronym association” identifies all of the numerous names used to refer to an entity (such as “IBM” versus “International Business Machine Corp”). This context-based acronym association capability is used as an anchoring point for merging the related insights. We have also developed co-reference capability to associate, for example, the pronoun “it” with the entity (such as iPhone) it refers to.
This phase also includes merging of phrases from local insights. For example, “cost-ineffective” is a synonym of “expensive.” The app merges these local insights before presenting them to the users.
Concluding Remarks on Language Technology and its Applications
NLP has been confined to labs for decades since beginning machine translation research in the early 1950s and up until the last decade. Until only a few years ago, NLP in applications had experienced only limited success. While it is moving very fast, NLP has not yet reached its prime time yet in the industry.
However, this technology is maturing and starting to show clear signs of serving as an enabling technology that can revolutionize how humans access information. We are already beyond the point of having to prove its value, the proof-of-concept stage. It just works, and we want to make it work better and more effectively. In the IT sector, more and more applications using NLP are expected to go live, ranging from social media, big data processing to intelligent assistants (e.g., Siri-like features) in mobile platforms. We are in a an exciting race towards making the language technology work in large-scale, real-life systems.
【相关篇什】
【立委科普:NLP 联络图】(姐妹篇)
【置顶:立委科学网博客NLP博文一览(定期更新版)】