Guide to Natural Language Processing
Table of contents.
We’ve been able to talk to machines in science fiction films as far back as 1927’s Metropolis. Though we can now “talk” to Alexa and Siri, the human-machine relationships played out on the screen are still very much in the realm of sci-fi. In the last 20 years, however, technology advances and a massive wealth of structured and unstructured data are helping us get closer to realizing these AI fantasies.
One of the cornerstones of this progress is Natural Language Processing (NLP) ‘s continuing evolution. Not only does NLP play a major role in a future where we have droid friends, but it is also powering many technologies that we use today, from enterprise search to chatbots. But just like linguistics, NLP is complicated and can be confusing. If you’re new to the concept or looking for an overview of what it is and how it’s used, then this guide is for you.
Natural Language Processing Overview
Natural language processing is the science behind machine comprehension. It’s the study of how to translate the spoken word into something a machine programmed with ones and zeros can understand. But it’s far more than just knowing words.
Language is a complex mix that can include humor, irony, expressions, or local sayings that even other humans sometimes find difficult to interpret. Even a small thing like replacing “it” for something previously mentioned can trip up a machine, like in this example below:
So how does NLP begin to tackle this challenge? Let’s start with a definition and some history first.
What is Natural Language Processing?
Simply speaking, natural language processing refers to any human-to-machine interaction where the computer is able to understand or generate human-like language. This is in contrast to what computers were previously limited to, which was some form of machine language.
Simple as the end result may appear, the actual process of getting a computer to perform NLP represents an extremely complex synergy of different scientific and technical disciplines.
NLP brings together computer science, linguistics, machine learning and artificial intelligence (AI). Depending on how the NLP is designed, the process may also involve statistics and data analysis. These elements, working in concert, make NLP capable of “hearing” or reading natural human speech and accurately parsing the words so the computer can take the action expected by the human user.
NLP’s roots are often traced back to the Georgetown experiment in 1954, which translated several Russian sentences into English. This was done using manually coded language rules and dictionary lookups. On the heels of this project, researchers initially thought they would have machine translation “solved” in a matter of years, but soon found it to be a far more difficult task because of the complexities and nuances of language.
Throughout the sixties and seventies, research continued into how to represent linguistic meaning computationally. Notable efforts included Joseph Weizenbaum’s ELIZA program, which simulated a conversation between patient and therapist, and MIT’s SHRDLU, which enabled a user to move blocks around a world using natural language. Here’s an example of a SHRDLU conversation:
Things quickly picked up when statistical models and machine learning algorithms started replacing the need for hand-written, hard-coded rules and grammar. As computers became more powerful and unstructured data became ubiquitous, deep learning and neural networks further advanced the field, paving the way for virtual assistants like Siri and Alexa. But as anyone with experience using Alexa today can attest, there is still work to be done (Alexa: “I’m sorry, I didn’t understand that.”).
NLP methods and approach
So how do we make computers understand us? The short answer is that it’s complicated–far more complex than this guide will dive into. That said, some basic steps have to happen to translate the spoken word into something machines can understand and respond to. Let’s take a look at what those are.
How does NLP Work?
There is no single way that NLP software functions. However, in general, almost all NLP tools have capabilities that enable them to distinguish syntactic and semantic rules and recognize many different words. This is not an easy thing to do.
Consider the following written enterprise search query: An employee wants to know if the company has declared December 26 a holiday. She types, “Do we get Boxing Day off?” That’s a question that most human beings can easily understand, even if they have to figure out that December 26 is sometimes called “Boxing Day.”
An NLP program will have to deconstruct this question on syntactic and semantic levels to enable the enterprise search engine to return a result. First, it has to recognize that “Do we” is a question reflecting “we,” meaning company employees. Then, it has to parse the words “get” and “off” as referring to getting time off from work versus, say, getting off a plane. It needs to understand what Boxing Day is… to the point where the search engine can respond to the actual query, “Is December 26 a holiday?”
Performing these tasks is an enormous challenge. Human language is highly complex, with English being arguably one of the most difficult.
NLP transforms data into something that a computer can interpret by starting with what is known as “data pre-processing.” In this stage, the NLP tool analyzes syntax and semantics to understand the grammatical structure of the text. It identifies how the words relate to one another in the specific context.
Data pre-processing may utilize tokenization, which breaks text down into semantic units for analysis. The process then tags different parts of speech, e.g., “we” is a noun, “do” is a verb, etc. It could then perform techniques called “stemming” and “lemmatization,” which reduce words to their root forms. The NLP tool might also filter out words like “a” and “the” that doesn’t convey any unique information.
The data pre-processing step generates a clean dataset for precise linguistic analysis. The NLP tool has an algorithm that then interprets the dataset. The algorithm can take one of several forms. The NLP tool uses grammatical rules created by expert linguists with a rule-based approach.
Alternatively, and this is increasingly common, NLP uses machine learning algorithms. These models are based on statistical methods that “train” the NLP to understand human language better. Furthermore, the NLP tool might take advantage of deep learning, sometimes called deep structured learning, based on artificial neural networks.
The techniques involved in NLP include both syntax analysis and semantic analysis.
Syntax is defined as “the arrangement of words and phrases to create well-formed sentences in a language.” Essentially, syntax is about how sentence structure creates grammatical sense. In terms of NLP, syntactical analysis uses machine learning algorithms to apply grammatical rules to a group of words to derive meaning. It often includes the following processes:
- Part of speech tagging: identifying the part of speech for every word, ie, the nouns, verbs, adjectives, etc.
- Morpheme segmentation: breaking words down into morphemes, defined as “a meaningful morphological unit of a language that cannot be further divided (e.g., in, come, -ing, forming incoming).”
- Stemming: removing the ends of words to arrive at their root or base form. For example, driver, driving, drives would be stemmed to driv.
- Lemmatization: grouping together the inflected forms of a word so they can be analyzed by the word’s lemma or dictionary form. This is more advanced than stemming, attempting to use tagging, morphology, and context to understand the intended meaning rather than just cutting off the end of the word.
Semantics is defined as the “meaning of a word, phrase, sentence, or text.” This is the most challenging task for NLP and is still being developed. Again, we’ll revisit the “Do we get Boxing Day off?” example. Semantics is the art of understanding that this question is about time off from work for a holiday. This is easy for a human but still difficult for a computer to understand the colloquialisms and shorthand manner of speaking that make up this sentence.
Semantic analysis often involves:
- Word sense disambiguation: identifying in which sense a word is being used according to its context. For example, knowing whether “date” refers to a day on the calendar, the fruit, or an outing.
- Named entity recognition: identifying and categorizing parts of a text into groups. For example, names of people or places.
- Tokenization: the process of segmenting running text into sentences and words, or pieces called tokens, and at the same time throwing away certain characters, such as punctuation.
- Relationship extraction: analyzing the semantic relationships in a text, often between two or more entities. For example, the phrase “Steve Jobs is one of the founders of Apple, which is headquartered in California” contains two different relationships:
Different semantic classification models can also be used, depending on the end goal:
By topic: sorting text into predefined categories, like “billing questions”, “account information,” etc.
By sentiment: understanding the emotion of content – positive, negative, neutral – to identify how happy or upset a customer is or how people feel about your brand.
By intent: tagging content based on level of interest. Is someone just browsing or ready to buy?
NLP and Machine Learning
As we mentioned earlier in this guide, the NLP field took off when machine learning was added. Machine learning accelerates and automates the text analysis and application of grammar rules. Machine learning algorithms study millions of texts to “learn” about human language, including syntax, semantics, and sentiment.
The more data the algorithm takes in, the more it learns and the better it becomes at analyzing and understanding content. It can apply what it has learned to future NLP analyses without being told what to do or how to do it.
There are two types of machine learning: supervised and unsupervised. With supervised machine learning, the algorithm is “trained” on a set of texts that have been marked up for what it should look for (i.e., parts-of-speech, named entities, relationships, etc.). Then, it’s given text without any instructions to analyze. Common supervised NLP machine learning algorithms include:
- Support Vector Machines
- Bayesian Networks
- Maximum Entropy
- Conditional Random Field
- Neural Networks/Deep Learning
Unsupervised machine learning is when you train an algorithm with text that hasn’t been marked up. It uses frameworks like Latent Semantic Indexing (LSI) or Matrix Factorization to guide the learning.
NLP Use Cases
The potential for NLP is ever-expanding, especially as we become more enmeshed with the technology around us. As computers and machines expand their roles in our lives, our need to communicate with them grows. Many are surprised to discover just how many of our everyday interactions are already made possible by NLP. Let’s look at some examples.
NLP applications put NLP to work in specific use cases, such as intelligent search. The technology has many uses, especially in the business world, where people need help from computers in dealing with large volumes of unstructured text data.
For example, a company might benefit from understanding its customers’ opinions of the brand. However, whatever insights regarding the brand are hidden with millions of social media messages. No human being is going to read them all. However, an NLP tool tuned for “sentiment analysis” could get the job done.
Other notable NLP applications include:
- Virtual Assistants and Chatbots – these familiar bits of software can answer questions and provide online help, among many use cases using NLP. They are usually configured to learn from every “conversation” they have.
- Market Research – NLP can help marketers learn about their customers by analyzing human language in unstructured data such as chat threads and online comments. This process uses text classification, another NLP application. Market research could also require text extraction, wherein the NLP tool looks for specific words, such as a product name, and extracts the relevant text for customer analysis. The software may be able to infer purchase intent, among other capabilities.
- Speech Recognition – NLP tools can recognize spoken languages, such as is the case with virtual assistants like Amazon’s Alexa and Apple’s Siri. The technology can also be put to work transcribing recordings and voice messages.
- Urgency Detection – An NLP tool can be trained to spot urgent issues in a stream of natural language. For example, suppose a company receives 100,000 support emails a day. In that case, it can use NLP urgency detection to find customers who need help right away by spotting phrases like “I’m locked out of my car” or “I am about to go into the hospital.”
- Spam Filtering – NLP enables the filtering and classification of emails to identify and block spam before it even gets to your inbox.
- Fake News Recognition – NLP is being used to analyze news sources for accuracy and bias to help flag questionable sources and prevent the proliferation of fake news.
- Recognition and prediction of diseases – using electronic health records, notes, and clinical trial reports, NLP can help diagnose and predict the likelihood of diseases and other health conditions.
- Talent Recruitment – Some companies use NLP to help find recruits with skills that match the job description and identify prospects even before entering the job market.
Challenges of NLP
NLP is an impressive technology, but it’s still relatively early in its lifecycle. In 2030, people will probably be amazed at how primitive 2020’s state-of-the-art looks. Many challenges exist. These include making speech recognition better and achieving a more consistent and accurate understanding of language.
This problem is partly due to some current limitations of AI. There is more to intelligence than just language, after all.
For instance, a computer may not understand the meaning behind a statement like, “My wife is angry at me because I didn’t eat her mother’s dessert.” There are a lot of cultural distinctions embedded in the human language.
Things like sarcasm and irony are lost even on some humans, so imagine the difficulty in training a machine to detect it. Add colorful expressions and regional variations, and the task becomes even more difficult. The technology to identify such nuances has not been invented so far.
What’s Next for NLP?
The next stage in the evolution of NLP will come from applying deep learning models to enable machines to understand language intent and present relevant information to employees, known as Natural Language Understanding (NLU). With NLU, the shift from employees finding information to information finding employees will accelerate, further unlocking productivity and innovation.
This is accomplished through the development of neural search. Neural Search is a new approach to retrieving information using neural networks. Neural Search is a next-generation NLP capability that will bring a step-change improvement to relevance, including:
- Better relevance for traditionally difficult queries thanks to language understanding
- Better relevance to all queries through a hybrid retrieval model and re-reranking
- Faster access to answers without having to read through lots of text
Neural Search will be natively embedded in Sinequa and will be a significant focus of our roadmap to further refine and develop these capabilities as customers apply them to their businesses.
NLP Glossary of Terms
- Natural language processing (NLP): A human-to-machine interaction where the computer is able to understand or generate human-like language.
- Syntax: The arrangement of words and phrases to create well-formed sentences in a language.
- Semantics: The meaning of a word, phrase, sentence, or text.
- Parsing: The process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.
- Part of speech tagging: Identifying the part of speech for every word, ie the nouns, verbs, adjectives, etc.
- Morpheme segmentation: Breaking words down into morphemes, which are defined as a meaningful morphological unit of a language that cannot be further divided (e.g. in, come, -ing, forming incoming).
- Lemmatization: Grouping together the inflected forms of a word so they can be analyzed by the word’s lemma or dictionary form.
- Stemming: Removing the ends of words to arrive at their root or base form. For example, driver, driving, drives would be stemmed to drive.
- Word sense disambiguation: Identifying in which sense a word is being used according to its context. For example, knowing whether “date” refers to a day on the calendar, the fruit, or an outing.
- Named entity recognition: Identifying and categorizing parts of a text into groups. For example, names of people or names of places.
- Tokenization: The process of segmenting running text into sentences and words, or pieces called tokens, and at the same time throwing away certain characters, such as punctuation.
- Relationship extraction: analyzing the semantic relationships in a text, often between two or more entities.