Text Analytics and NLP

Text analytics is the method of extracting meaningful insights and answering questions from text data, such as those to do with the length of sentences, length of words, word count, and finding words from the text. Let's understand this with an example.

Suppose we are doing a survey using news articles. Let's say we have to find the top five countries that contributed the most in the field of space technology in the past 5 years. So, we will collect all the space technology-related news from the past 5 years using the Google News API. Now, we must extract the names of countries in these news articles. We can perform this task using a file containing a list of all the countries in the world.

Next, we will create a dictionary in which keys will be the country names and their values will be the number of times the country name is found in the news articles. To search for a country in the news articles, we can use a simple word regex. After we have completed searching all the news articles, we can sort the country names by the values associated with them. In this way, we will come up with the top five countries that contributed the most to space technology in the last 5 years.

This is a typical example of text analytics, in which we are generating insights from text without getting into the semantics of the language.

It is important here to note the difference between text analytics and NLP. The art of extracting useful insights from any given text data can be referred to as text analytics. NLP, on the other hand, helps us in understanding the semantics and the underlying meaning of text, such as the sentiment of a sentence, top keywords in text, and parts of speech for different words. It is not just restricted to text data; voice (speech) recognition and analysis also come under the domain of NLP. It can be broadly categorized into two types: Natural Language Understanding (NLU) and Natural Language Generation (NLG). A proper explanation of these terms is provided here:

  • NLU: NLU refers to a process by which an inanimate object with computing power is able to comprehend spoken language. As mentioned earlier, Siri and Alexa use techniques such as Speech to Text to answer different questions, including inquiries about the weather, the latest news updates, live match scores, and more.
  • NLG: NLG refers to a process by which an inanimate object with computing power is able to communicate with humans in a language that they can understand or is able to generate human-understandable text from a dataset. Continuing with the example of Siri or Alexa, ask one of them about the chances of rainfall in your city. It will reply with something along the lines of, "Currently, there is no chance of rainfall in your city." It gets the answer to your query from different sources using a search engine and then summarizes the results. Then, it uses Text to Speech to relay the results in verbally spoken words.

So, when a human speaks to a machine, the machine interprets the language with the help of the NLU process. By using the NLG process, the machine generates an appropriate response and shares it with the human, thus making it easier for humans to understand the machine. These tasks, which are part of NLP, are not part of text analytics. Let's walk through the basics of text analytics and see how we can execute it in Python.

Before going to the exercises, let's define some prerequisites for running the exercises. Whether you are using Windows, Mac or Linux, you need to run your Jupyter Notebook in a virtual environment. You will also need to ensure that you have installed the requirements as stated in the requirements.txt file on https://packt.live/3fJ4qap.

Exercise 1.01: Basic Text Analytics

In this exercise, we will perform some basic text analytics on some given text data, including searching for a particular word, finding the index of a word, and finding a word at a given position. Follow these steps to implement this exercise using the following sentence:

"The quick brown fox jumps over the lazy dog."

  1. Open a Jupyter Notebook.
  2. Assign a sentence variable the value 'The quick brown fox jumps over the lazy dog'. Insert a new cell and add the following code to implement this:

    sentence = 'The quick brown fox jumps over the lazy dog'

    sentence

  3. Check whether the word 'quick' belongs to that text using the following code:

    def find_word(word, sentence):

        return word in sentence

    find_word('quick', sentence)

    The preceding code will return the output 'True'.

  4. Find out the index value of the word 'fox' using the following code:

    def get_index(word, text):

        return text.index(word)

    get_index('fox', sentence)

    The code will return the output 16.

  5. To find out the rank of the word 'lazy', use the following code:

    get_index('lazy', sentence.split())

    This code generates the output 7.

  6. To print the third word of the given text, use the following code:

    def get_word(text,rank):

        return text.split()[rank]

    get_word(sentence,2)

    This will return the output brown.

  7. To print the third word of the given sentence in reverse order, use the following code:

    get_word(sentence,2)[::-1]

    This will return the output nworb.

  8. To concatenate the first and last words of the given sentence, use the following code:

    def concat_words(text):

        """

        This method will concat first and last

        words of given text

        """

        words = text.split()

        first_word = words[0]

        last_word = words[len(words)-1]

        return first_word + last_word

    concat_words(sentence)

    Note

    The triple-quotes ( """ ) shown in the code snippet above are used to denote the start and end points of a multi-line code comment. Comments are added into code to help explain specific bits of logic.

    The code will generate the output Thedog.

  9. To print words at even positions, use the following code:

    def get_even_position_words(text):

        words = text.split()

        return [words[i] for i in range(len(words)) if i%2 == 0]

    get_even_position_words(sentence)

    This code generates the following output:

    ['The', 'brown', 'jumps', 'the', 'dog']

  10. To print the last three letters of the text, use the following code:

    def get_last_n_letters(text, n):

        return text[-n:]

    get_last_n_letters(sentence,3)

    This will generate the output dog.

  11. To print the text in reverse order, use the following code:

    def get_reverse(text):

        return text[::-1]

    get_reverse(sentence)

    This code generates the following output:

    'god yzal eht revo spmuj xof nworb kciuq ehT'

  12. To print each word of the given text in reverse order, maintaining their sequence, use the following code:

    def get_word_reverse(text):

        words = text.split()

        return ' '.join([word[::-1] for word in words])

    get_word_reverse(sentence)

    This code generates the following output:

    ehT kciuq nworb xof spmuj revo eht yzal god

We are now well acquainted with basic text analytics techniques.

Note

To access the source code for this specific section, please refer to https://packt.live/38Yrf77.

You can also run this example online at https://packt.live/2ZsCvpf.

In the next section, let's pe deeper into the various steps and subtasks in NLP.