An Introduction to Natural Language Processing
Welcome to the first chapter of the NLP Grimoire! A series that will enable its diligent readers to harness the power of NLP and create cosmically powerful applications. All you need to know is some Python programming. This chapter will be a little more theoretical but it is suggested to code along with the given examples. So, let’s get cooking.
What exactly is NLP?
Imagine 5, 10 or 15 years from now. How do you think you would interact with the tech of that time? Now some tech-savvy people might say we’d just think and it’ll happen and that’s not entirely incorrect. But most of you would say that you’d be able to talk to the machines and they’ll be able to understand you.
That is what Natural Language Processing aims to achieve. It is a collection of techniques and methods that makes human language accessible to computers. But what is human language? We, humans, use a medium to express our feelings, thoughts and ideas. That medium is called language. Whatever we speak, read, write and listen to is in the form of human language otherwise known as natural language. So, processing language in a way that enables computers to understand what is being communicated and respond back properly is the goal of NLP. In some fancy words,
“Natural Language Processing is an interdisciplinary field of Artificial Intelligence, Computational Linguistics, Computer Science and many other disciplines and it focuses on the interactions between computers and human language”
Now that we have an idea of what NLP means, let’s talk about how it is interdisciplinary and how different domains of knowledge lend their methods to form something as powerful as NLP. Although there are many disciplines that influence NLP, we will discuss some major ones. Have a look at the diagram below.
Let’s talk about these disciplines for a moment and try to understand how NLP draws knowledge from them.
Computational linguistics and natural language processing are often used synonymously but there is a difference. Computational linguistics is the study of language with the help of computational modelling. Language itself is the focus here. Whereas in NLP, the focus is on developing techniques and algorithms to represent human language in a way that computers can understand it and enabling computers to make use of this understanding in a practical way. Simply put, computational linguistics is a theoretical study of language and NLP is a more practical, engineering-focused discipline.
Computer science has been relevant to NLP for a long time. From the times of classical NLP to the modern age of big data, computer science has been providing NLP with powerful, precise and efficient tools. Be it the Formal Language Theory to model language or algorithm analysis to analyze the computational complexity of algorithms, or parallelization to process huge amounts of text data, computer science has made the lives of NLP practitioners easier.
Artificial Intelligence is the branch of computer science which deals with the development of machines that can perform various tasks with human-level performance. One such task is interaction with humans. Achieving the level of human-like conversation is one of the indicators of achieving intelligence (see Turing Test). AI guides NLP to reason and make intelligent decisions concerning language.
Machine Learning is a branch of AI that focuses on developing systems that can automatically learn and improve with experience without being explicitly programmed. A lot of modern NLP can also be classified under ML research with a few distinct differences here and there like the discrete nature of text data or the compositional nature of language. We can use ML in conjunction with NLP to develop systems that can learn and get better with experience.
Neurolinguistics is the study of the brain’s capacity to comprehend, produce and learn human language. The ultimate goal is to understand how exactly our brain functions to deal with various aspects of language. Once we have an understanding of how it is done, we can try to model it mathematically and then translate it into an intelligent system. Unlike vision, language is still to be understood at a deeper level.
Now that we’ve talked about how NLP draws its power from different dimensions of various disciplines, let’s try to understand what are some of the common problems that we have to solve in NLP.
Tasks in NLP
Analysis and processing of language are often categorized into three stages or phases. Syntactic, semantic and pragmatic. Another initial stage that is often not mentioned in the Lexical analysis stage which logically precedes syntactic analysis. Each of these phases has many constituents, but we don’t use all of them every time. We are not even required to follow all these phases. It all depends upon the problem at hand and the kind of data we have. This categorization just helps in understanding these concepts. Have a look at this picture.
Let’s go through these phases one by one.
Lexical analysis is the process of analyzing and trying to understand what words mean. We study lexemes in lexical analysis. A lexeme is just a sequence of characters often related to the morpheme, which is the smallest linguistic unit with some meaning. Lexical analysis is most often the first step in processing language. Many tasks fall under the umbrella of lexical analysis but we will have a look at the most common ones.
Tokenization is the process of splitting text into smaller units called tokens. Tokens can be either characters, subwords, words, phrases, or sentences. It is one of the most common tasks in NLP and you might use it for almost every problem. Tokenization isn’t as straightforward as one might think. For example, in languages like Chinese, Japanese, Thai, it is impossible to tokenize using white spaces as we do in the English language. Different languages employ different tokenizers and the choice of tokenizers depends upon requirements. Basic tokenization can be done easily in Python using the NLTK library.
Morphological analysis is the study of determining the morphemes of given words. Morphemes are the smallest units of words that carry some meaning. Unlike tokenization, it operates on a subword and word level. For example, “mangoes” can be decomposed into “mango” (stem word) and “es” (suffix to show plurality). Following is code to extract morphemes from words using Python library Polyglot.
It is the process of reducing a word into its stem or base word. Stemming usually serves as a quick solution and is based upon handwritten rules and this isn’t as accurate as one might think. But for most NLP applications it doesn’t matter much. Following is how you can perform stemming using NLTK.
Lemmatization is almost like stemming, the only difference is that it aims to reduce a word to its base form which is also a dictionary word. In doing so it achieves better accuracy than stemming but is slower. Following is the code to perform lemmatization using the NLTK library.
A sentence is the smallest unit of language that can convey a thought, idea or proposition. But sentences are not just a bunch of words, there is a certain level of structure that is important to understand to extract the correct meaning of the sentence. For example, consider the sentence “ The boat sailed on the river sank” ( source). This sentence is grammatically correct is difficult to understand unless you know that the word “sailed” is not used as a verb here. Syntactic analysis not only ensures that sentences are grammatically correct but also help us understand their meaning. Some common tasks that fall under this phase are as follows.
Parts of Speech Tagging
Parts of Speech tagging (PoS tagging) is the process of marking the words in a text with their respective, well, part of speech depending upon their meaning and context. This helps in validating grammar as well as understanding ambiguous sentences. You can perform PoS tagging using NLTK like this.
Grammar in the language is used to define some rules and it describes the language. We can use grammar to disambiguate sentences and try to understand their true meaning. We can also verify if sentences are grammatically correct. It heavily employs formal language theory concepts like Context-Free Grammar, Finite and Pushdown Automata etc. We can define our grammar in NLTK and construct syntax trees.
Semantic analysis is all about trying to understand the intended meaning of sentences. It considers the literal meaning as well as the context to make sense of the sentences. There is heavy use of machine learning methods in this phase as we intend our system to continue to learn while maintaining a steady knowledge about the language. Following are some of the tasks that can be categorized under the semantic analysis phase.
Word Sense Disambiguation
Word Sense Disambiguation (WSD) involves resolving the ambiguity in the texts due to certain words by trying to figure out the actual sense of such words. For example, consider the word “bank”. It can have different meanings and we humans can figure out the right meaning due to context and our vast knowledge. But if we want computers to understand a text better, we would need them to be able to differentiate between different senses. WSD can use either knowledge-based, machine learning-based or a rule-based approach. You can perform WSD using NLTK as follows.
Semantic Role Labelling
In this task, we try to make sense of an input sentence in terms of the scene it portrays so we can assign certain roles to entities and ask questions like who did what to whom etc. AllenNLP uses the latest and very powerful SRL BERT model for this purpose and you can use it online here.
Textual Entailment is the task of verifying a direct relationship between two pieces of text. Have a look at the examples from the 3rd PASCAL RTE Challenge (source) below. Our goal is to understand the semantic i.e. meaning of these texts so we can predict the entailment.
The final phase is the pragmatic analysis which is concerned with discourse. It tries to establish the meaning of the conversation with all of its contexts. Previously, we were dealing with smaller units of speech like words or sentences, but this phase takes the context of the entire conversation or text and then aims to understand what is being meant exactly.
In this task, we try to determine which phrases in a document refer to the same entity. For example, in the sentence “John put the carrot on the plate and ate it”, what is “it” referring to? This problem becomes complex when more entities are involved and documents are longer. We can use the Stanford NLP library for this purpose.
Now that we have understood what NLP is and what are some tasks we need to do in NLP, let’s have a quick overview of the applications of NLP.
Applications of NLP
According to an estimate by IBM (source), the amount of data is expected to reach 175 zettabytes by 2025. 1 zettabyte is equal to 1 trillion gigabytes. There are 12 zeros in a trillion. And out of all the data we have, 80% of it is unstructured including but not limited to audios, videos, text documents etc. And as we humans communicate using languages, we have built linguistic interfaces with a lot of machines. There are a ton of NLP applications but I’m going to list a few major ones below.
- Chatbots and Virtual Assistants like Siri, Alexa etc
- Machine Translation like Google Translate
- Search Engines like Google, Bing etc
- Autocorrect and Autocomplete like in all smartphones
- Targeted Advertisement
- Email Filtering like spam blockers in Gmail
- Sentiment Analysis
- Text Summarization
- Automatic Image Captioning
- Text Generation
- …many others…
Key takeaways from this article are
- NLP is a set of methods that enables humans and computers to interact in natural language
- Tasks in NLP can be divided into lexical, syntactic, semantic and pragmatic stages.
- NLP is used everywhere around us and is an important part of our technology.