By Phillip Green, Founder and CEO of Informatics4AI
Building a machine learning model based on unstructured full-text such as doctors’ notes, customer conversations, etc. can be frustrating and often ends in failure. As Michael I. Jordan, a professor at the University of California, Berkeley and a leading figure in machine learning said “There is no real intelligence there.” (NY Times, June 20, 2018). Indrek Vainu CEO of AlphaBlues, an enterprise chatbot company, concurs stating, machines are “largely clueless when fed unstructured data, such as free text.” He goes on to state that “extraction of meaning — or more specifically, semantic relations between words in free text — is a complex task. The complexity is mostly due to the rich web of relations between the conceptual entities the words represent.”
At Informatics4AI, we support these views. Most of our customers working with unstructured full-text datasets have found their initial machine learning efforts do not produce predictions with the needed accuracy. They often try to use standard NLP techniques (such as tokenization, stemming, parts-of-speech tagging, parsing, named entity recognition, etc.). But these almost always prove inadequate when trying to build a high-quality model.
Label the text for meaning
However, all is not lost. The key is adding structure and meaning to the raw data, and by doing so, enable the machine to understand the text and thus begin to learn. By labeling the text for meaning you are providing a map for the machine to navigate the text.
We recommend several advance NLP techniques, including:
-
Labeling nouns and noun phrases for meaning (which is the focus of this post).
-
Labeling (most often) adverbs and adjectives for sentiment.
-
Labeling verbs for intent.
Labeling nouns and noun phrases
Much of the meaning in text is stored in nouns and noun phrases. Unfortunately, machines don’t know even the simplest of things such as - lions and zebras are animals living in Africa and that one eats the other. However, the machine can learn these things via the creation of a semantic ontology of entities.
The purpose of the semantic ontology is to: a) encode commonalities between concepts in a specific domain (e.g. both “yellow fever” and “malaria” are “diseases spread by mosquitoes”), and b) to encode how words relate to concepts, which may vary depending upon the context (e.g. that “hand” is sometimes a worker (hired hand), sometimes applause (give them a big hand), sometimes busy (hands full), and sometimes responsible for (in the hands of the judge). The entity ontology is a semantically unambiguous dictionary that allows the text to be labeled for meaning and thus enable the machine to learn with more accuracy than simply processing the ambiguous raw text.
A key problem with building ontologies is that to date most have been created by humans, which is both costly and time-consuming. This is no longer true. Advanced NLP systems can now create a semantic entity ontology in an automated way, simplifying, speeding and reducing the cost of this critical step.
A few notes about data labeling nouns and noun phrases.
Overfit
Take full-text medical records involving heart disease at a specific hospital. When examining thousands of medical records Dr. Amanda Jones frequently occurs in records about heart attacks, because she is the Chief of Cardiology. When processing raw medical records, the AI model will often “overfit” and may think Dr. Jones is a cause of heart attacks, rather than the attending doctor. By labeling the data using an entity ontology, we can prevent overfit and help the machine to understand the difference between doctors and the factors that can cause heart attacks.
Ontologies
Some systems create what is known as an orthogonal ontology where nouns and noun phrases are only placed in one concept. While this may be acceptable in some applications, in others it may be highly problematic. (For example, “hand,” see above, has multiple meanings and the machine needs to know which one.) Carefully consider the type of ontology being created and how it will be used before you invest heavily in its creation.
While labeling the nouns and noun phrases for meaning is probably the most important step for improving your model, here are two other techniques to consider.
Labeling adverbs and adjectives for sentiment
When using sentiment analysis for AI, document-level sentiment is essentially useless. AI requires sentence and often phrase-level sentiment. For example, “The x-ray revealed good news regarding the original tumor, it unfortunately revealed an advanced second mass.” For AI, we want to be able to see the first part of the sentence as positive, the second part of the sentence as negative, and we want to identify that “advanced” is an intensifier and therefore really negative. Lastly, we do not want the sentence to be classified as neutral (half good plus half bad), we want to see the sentence as two distinct thoughts one positive and one negative.
Labeling verbs for intent
The following capabilities are important to consider when evaluating intent detection:
-
Can you train your NLP to understand a domain-specific set of intents? (By this I mean, not just build a hand-coded set of intents but build an ML assisted understanding of intents.)
-
Can you build rules with whitelisted or blacklisted words to supplement the learned intents?
-
Does your system have a way to disambiguate intents between the intender and the intendee?
An example
Let’s look at a simple example of how data labeling can provide predictive lift. Take the following two sentences from the travel industry:
-
The room was dope.
-
The check-in clerk was a dope.
By labeling these sentences for meaning, the model can understand that the hotel has great rooms, and also has a service problem. Without labeling for meaning the machine is very likely to misunderstand these sentences and many others, which will result in a poor model.
NLP tools that enable full-text to be labeled for meaning will improve your model and provide you with predictive lift. See the whitepaper on “Data Labeling Full-Text Datasets for Predictive Lift” for a more in-depth look at this topic.
Have you been struggling with building models using unstructured full-text? I would love to hear your experiences. Thanks for reading this post, and your feedback is always welcome.
Bio: Phillip Green is the founder and CEO of Informatics4AI. He has decades of experience in knowledge management, search, and artificial intelligence. Informatics4AI helps customers improve the performance of search systems and AI models built on text datasets. He can be reached at pgreen@informatics4ai.com.
Related: