An Introduction to Natural Language Processing

Welcome to the first chapter of the NLP Grimoire! A series that will enable its diligent readers to harness the power of NLP and create cosmically powerful applications. All you need to know is some Python programming. This chapter will be a little more theoretical but it is suggested to code along with the given examples. So, let’s get cooking.

What exactly is NLP?

Imagine 5, 10 or 15 years from now. How do you think you would interact with the tech of that time? Now some tech-savvy people might say we’d just think and it’ll happen and that’s not entirely incorrect. But most of you would say that you’d be able to talk to the machines and they’ll be able to understand you.
That is what Natural Language Processing aims to achieve. It is a collection of techniques and methods that makes human language accessible to computers. But what is human language? We, humans, use a medium to express our feelings, thoughts and ideas. That medium is called language. Whatever we speak, read, write and listen to is in the form of human language otherwise known as natural language. So, processing language in a way that enables computers to understand what is being communicated and respond back properly is the goal of NLP. In some fancy words,

“Natural Language Processing is an interdisciplinary field of Artificial Intelligence, Computational Linguistics, Computer Science and many other disciplines and it focuses on the interactions between computers and human language”

Now that we have an idea of what NLP means, let’s talk about how it is interdisciplinary and how different domains of knowledge lend their methods to form something as powerful as NLP. Although there are many disciplines that influence NLP, we will discuss some major ones. Have a look at the diagram below.

Relationship of NLP with other domains

Let’s talk about these disciplines for a moment and try to understand how NLP draws knowledge from them.

Computational Linguistics

Computational linguistics and natural language processing are often used synonymously but there is a difference. Computational linguistics is the study of language with the help of computational modelling. Language itself is the focus here. Whereas in NLP, the focus is on developing techniques and algorithms to represent human language in a way that computers can understand it and enabling computers to make use of this understanding in a practical way. Simply put, computational linguistics is a theoretical study of language and NLP is a more practical, engineering-focused discipline.

Computer Science

Computer science has been relevant to NLP for a long time. From the times of classical NLP to the modern age of big data, computer science has been providing NLP with powerful, precise and efficient tools. Be it the Formal Language Theory to model language or algorithm analysis to analyze the computational complexity of algorithms, or parallelization to process huge amounts of text data, computer science has made the lives of NLP practitioners easier.

Artificial Intelligence

Artificial Intelligence is the branch of computer science which deals with the development of machines that can perform various tasks with human-level performance. One such task is interaction with humans. Achieving the level of human-like conversation is one of the indicators of achieving intelligence (see Turing Test). AI guides NLP to reason and make intelligent decisions concerning language.

Machine Learning

Machine Learning is a branch of AI that focuses on developing systems that can automatically learn and improve with experience without being explicitly programmed. A lot of modern NLP can also be classified under ML research with a few distinct differences here and there like the discrete nature of text data or the compositional nature of language. We can use ML in conjunction with NLP to develop systems that can learn and get better with experience.

Neurolinguistics

Neurolinguistics is the study of the brain’s capacity to comprehend, produce and learn human language. The ultimate goal is to understand how exactly our brain functions to deal with various aspects of language. Once we have an understanding of how it is done, we can try to model it mathematically and then translate it into an intelligent system. Unlike vision, language is still to be understood at a deeper level.

Now that we’ve talked about how NLP draws its power from different dimensions of various disciplines, let’s try to understand what are some of the common problems that we have to solve in NLP.

Tasks in NLP

Analysis and processing of language are often categorized into three stages or phases. Syntactic, semantic and pragmatic. Another initial stage that is often not mentioned in the Lexical analysis stage which logically precedes syntactic analysis. Each of these phases has many constituents, but we don’t use all of them every time. We are not even required to follow all these phases. It all depends upon the problem at hand and the kind of data we have. This categorization just helps in understanding these concepts. Have a look at this picture.

Tasks in NLP

Let’s go through these phases one by one.

Lexical Analysis

Lexical analysis is the process of analyzing and trying to understand what words mean. We study lexemes in lexical analysis. A lexeme is just a sequence of characters often related to the morpheme, which is the smallest linguistic unit with some meaning. Lexical analysis is most often the first step in processing language. Many tasks fall under the umbrella of lexical analysis but we will have a look at the most common ones.

Tokenization

Tokenization is the process of splitting text into smaller units called tokens. Tokens can be either characters, subwords, words, phrases, or sentences. It is one of the most common tasks in NLP and you might use it for almost every problem. Tokenization isn’t as straightforward as one might think. For example, in languages like Chinese, Japanese, Thai, it is impossible to tokenize using white spaces as we do in the English language. Different languages employ different tokenizers and the choice of tokenizers depends upon requirements. Basic tokenization can be done easily in Python using the NLTK library.

Tokenization with NLTK

Morphological Analysis

Morphological analysis is the study of determining the morphemes of given words. Morphemes are the smallest units of words that carry some meaning. Unlike tokenization, it operates on a subword and word level. For example, “mangoes” can be decomposed into “mango” (stem word) and “es” (suffix to show plurality). Following is code to extract morphemes from words using Python library Polyglot.

Morphological Analysis with Polyglot

Stemming

It is the process of reducing a word into its stem or base word. Stemming usually serves as a quick solution and is based upon handwritten rules and this isn’t as accurate as one might think. But for most NLP applications it doesn’t matter much. Following is how you can perform stemming using NLTK.

Stemming with NLTK

Lemmatization

Lemmatization is almost like stemming, the only difference is that it aims to reduce a word to its base form which is also a dictionary word. In doing so it achieves better accuracy than stemming but is slower. Following is the code to perform lemmatization using the NLTK library.

Lemmatization with NLTK

Syntactic Analysis

A sentence is the smallest unit of language that can convey a thought, idea or proposition. But sentences are not just a bunch of words, there is a certain level of structure that is important to understand to extract the correct meaning of the sentence. For example, consider the sentence “ The boat sailed on the river sank” ( source). This sentence is grammatically correct is difficult to understand unless you know that the word “sailed” is not used as a verb here. Syntactic analysis not only ensures that sentences are grammatically correct but also help us understand their meaning. Some common tasks that fall under this phase are as follows.

Parts of Speech Tagging

Parts of Speech tagging (PoS tagging) is the process of marking the words in a text with their respective, well, part of speech depending upon their meaning and context. This helps in validating grammar as well as understanding ambiguous sentences. You can perform PoS tagging using NLTK like this.

PoS Tagging with NLTK

Syntactic Parsing

Grammar in the language is used to define some rules and it describes the language. We can use grammar to disambiguate sentences and try to understand their true meaning. We can also verify if sentences are grammatically correct. It heavily employs formal language theory concepts like Context-Free Grammar, Finite and Pushdown Automata etc. We can define our grammar in NLTK and construct syntax trees.

Syntactic Parsing with NLTK

Semantic Analysis

Semantic analysis is all about trying to understand the intended meaning of sentences. It considers the literal meaning as well as the context to make sense of the sentences. There is heavy use of machine learning methods in this phase as we intend our system to continue to learn while maintaining a steady knowledge about the language. Following are some of the tasks that can be categorized under the semantic analysis phase.

Word Sense Disambiguation

Word Sense Disambiguation (WSD) involves resolving the ambiguity in the texts due to certain words by trying to figure out the actual sense of such words. For example, consider the word “bank”. It can have different meanings and we humans can figure out the right meaning due to context and our vast knowledge. But if we want computers to understand a text better, we would need them to be able to differentiate between different senses. WSD can use either knowledge-based, machine learning-based or a rule-based approach. You can perform WSD using NLTK as follows.

Word Sense Disambiguation with NLTK

Semantic Role Labelling

In this task, we try to make sense of an input sentence in terms of the scene it portrays so we can assign certain roles to entities and ask questions like who did what to whom etc. AllenNLP uses the latest and very powerful SRL BERT model for this purpose and you can use it online here.

Textual Entailment

Textual Entailment is the task of verifying a direct relationship between two pieces of text. Have a look at the examples from the 3rd PASCAL RTE Challenge (source) below. Our goal is to understand the semantic i.e. meaning of these texts so we can predict the entailment.

3rd PASCAL RTE Challenge

Pragmatic Analysis

The final phase is the pragmatic analysis which is concerned with discourse. It tries to establish the meaning of the conversation with all of its contexts. Previously, we were dealing with smaller units of speech like words or sentences, but this phase takes the context of the entire conversation or text and then aims to understand what is being meant exactly.

Co-Reference Resolution

In this task, we try to determine which phrases in a document refer to the same entity. For example, in the sentence “John put the carrot on the plate and ate it”, what is “it” referring to? This problem becomes complex when more entities are involved and documents are longer. We can use the Stanford NLP library for this purpose.

Now that we have understood what NLP is and what are some tasks we need to do in NLP, let’s have a quick overview of the applications of NLP.

Applications of NLP

According to an estimate by IBM (source), the amount of data is expected to reach 175 zettabytes by 2025. 1 zettabyte is equal to 1 trillion gigabytes. There are 12 zeros in a trillion. And out of all the data we have, 80% of it is unstructured including but not limited to audios, videos, text documents etc. And as we humans communicate using languages, we have built linguistic interfaces with a lot of machines. There are a ton of NLP applications but I’m going to list a few major ones below.

  • Chatbots and Virtual Assistants like Siri, Alexa etc
  • Machine Translation like Google Translate
  • Search Engines like Google, Bing etc
  • Autocorrect and Autocomplete like in all smartphones
  • Targeted Advertisement
  • Email Filtering like spam blockers in Gmail
  • Sentiment Analysis
  • Text Summarization
  • Automatic Image Captioning
  • Text Generation
  • …many others…

Takeaways

Key takeaways from this article are

  1. NLP is a set of methods that enables humans and computers to interact in natural language
  2. Tasks in NLP can be divided into lexical, syntactic, semantic and pragmatic stages.
  3. NLP is used everywhere around us and is an important part of our technology.

All the examples can be found on my GitHub here. Follow me on my Medium, Twitter and LinkedIn for more content like this. Happy learning!

--

--

--

Data Science | NLP | Machine Learning | Deep Learning

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Mesh R-CNN — 3D Shape Prediction

The occurrence of aurora: using machine learning techniques

Part 18 : Norms

A Pathway to Machine Learning and AI for Mechanical Engineers.

Explaining Machine Learning

Evaluating AI-Powered Autonomous and Assistive Diagnostic Tools — Performance Metrics

Reinforcement Learning, Part 5: Monte-Carlo and Temporal-Difference Learning

Diagnosing Diabetes Using Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Eshban Suleman

Eshban Suleman

Data Science | NLP | Machine Learning | Deep Learning

More from Medium

NLP : Natural Language Processing

The Dangers of Context-Insensitivity in NLP

Introduction to Natural Language Processing (NLP)

NLP — From Word Embedding to Transformers