The Natural Language Processing Workshop

书名：The Natural Language Processing Workshop
作者名：Rohan Chopra Aniruddha M. Godbole Nipun Sadvilkar Muzaffar Bashir Shah Sohom Ghosh Dwight Gunning
本章字数：2991字
更新时间：2025-02-17 19:44:48

Finding Text Similarity – Application of Feature Extraction

So far in this chapter, we have learned how to generate vectors from text. These vectors are then fed to machine learning algorithms to perform various tasks. Other than using them in machine learning applications, we can also perform simple NLP tasks using these vectors. Finding the string similarity is one of them. This is a technique in which we find the similarity between two strings by converting them into vectors. The technique is mainly used in full-text searching.

There are different techniques for finding the similarity between two strings or texts. They are explained one by one here:

Cosine similarity: The cosine similarity is a technique to find the similarity between the two vectors by calculating the cosine of the angle between them. As we know, the cosine of a zero-degree angle is 1 (meaning the cosine similarity of two identical vectors is 1), while the cosine of 180 degrees is -1 (meaning the cosine of two opposite vectors is -1). Thus, we can use this cosine angle to find the similarity between the vectors from 1 to -1. To use this technique in finding text similarity, we convert text into vectors using one of the previously discussed techniques and find the similarity between the vectors of the text. This is calculated as follows:

Figure 2.26: Cosine similarity

Here, A and B are two vectors, A.B is the dot product of two vectors, and |A| and |B| are the magnitude of two vectors.

Jaccard similarity: This is another technique that's used to calculate the similarity between the two texts, but it only works on BoW vectors. The Jaccard similarity is calculated as the ratio of the number of terms that are common between two text documents to the total number of unique terms present in those texts.
Consider the following example. Suppose there are two texts:
Text 1: I like detective Byomkesh Bakshi.
Text 2: Byomkesh Bakshi is not a detective; he is a truth seeker.
The common terms are "Byomkesh," "Bakshi," and "detective."
The number of common terms in the texts is three.
The unique terms present in the texts are "I," "like," "is," "not," "a," "he," "is," "truth," and "seeker." So, the number of unique terms is nine.
Therefore, the Jaccard similarity is 3/9 = 0.3.

To get a better understanding of text similarity, we will complete an exercise.

Exercise 2.16: Calculating Text Similarity Using Jaccard and Cosine Similarity

In this exercise, we will calculate the Jaccard and cosine similarity for a given pair of texts. Follow these steps to complete this exercise:

Open a Jupyter Notebook.
Insert a new cell and add the following code to import the necessary packages:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lemmatizer = WordNetLemmatizer()
Create a function to extract the Jaccard similarity between a pair of sentences by adding the following code:
def extract_text_similarity_jaccard(text1, text2):
    """
    This method will return Jaccard similarity between two texts
    after lemmatizing them.
    :param text1: text1
    :param text2: text2
    :return: similarity measure
    """
    lemmatizer = WordNetLemmatizer()
    words_text1 = [lemmatizer.lemmatize(word.lower()) \
                   for word in word_tokenize(text1)]
    words_text2 = [lemmatizer.lemmatize(word.lower()) \
                   for word in word_tokenize(text2)]
    nr = len(set(words_text1).intersection(set(words_text2)))
    dr = len(set(words_text1).union(set(words_text2)))
    jaccard_sim = nr / dr
    return jaccard_sim
Declare three variables named pair1, pair2, and pair3, as follows.
pair1 = ["What you do defines you", "Your deeds define you"]
pair2 = ["Once upon a time there lived a king.", \
"Who is your queen?"]
pair3 = ["He is desperate", "Is he not desperate?"]
To check the Jaccard similarity between the statements in pair1, write the following code:
extract_text_similarity_jaccard(pair1[0],pair1[1])
The preceding code generates the following output:
0.14285714285714285
To check the Jaccard similarity between the statements in pair2, write the following code:
extract_text_similarity_jaccard(pair2[0],pair2[1])
The preceding code generates the following output:
0.0
To check the Jaccard similarity between the statements in pair3, write the following code:
extract_text_similarity_jaccard(pair3[0],pair3[1])
The preceding code generates the following output:
0.6
To check the cosine similarity, use the TfidfVectorizer() method to get the vectors of each text:
def get_tf_idf_vectors(corpus):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_results = tfidf_vectorizer.fit_transform(corpus).\
                    todense()
    return tfidf_results
Create a corpus as a list of texts and get the TFIDF vectors of each text document. Add the following code to do this:
corpus = [pair1[0], pair1[1], pair2[0], \
pair2[1], pair3[0], pair3[1]]
tf_idf_vectors = get_tf_idf_vectors(corpus)
To check the cosine similarity between the initial two texts, write the following code:
cosine_similarity(tf_idf_vectors[0],tf_idf_vectors[1])
The preceding code generates the following output:
array([[0.3082764]])
To check the cosine similarity between the third and fourth texts, write the following code:
cosine_similarity(tf_idf_vectors[2],tf_idf_vectors[3])
The preceding code generates the following output:
array([[0.]])
To check the cosine similarity between the fifth and sixth texts, write the following code:
cosine_similarity(tf_idf_vectors[4],tf_idf_vectors[5])
The preceding code generates the following output:
array([[0.80368547]])

So, in this exercise, we learned how to check the similarity between texts. As you can see, the texts "He is desperate" and "Is he not desperate?" returned similarity results of 0.80 (meaning they are highly similar), whereas sentences such as "Once upon a time there lived a king." and "Who is your queen?" returned zero as their similarity measure.

Note

To access the source code for this specific section, please refer to https://packt.live/2Eyw0JC.

You can also run this example online at https://packt.live/2XbGRQ3.

Word Sense Disambiguation Using the Lesk Algorithm

The Lesk algorithm is used for resolving word sense disambiguation. Suppose we have a sentence such as "On the bank of river Ganga, there lies the scent of spirituality" and another sentence, "I'm going to withdraw some cash from the bank". Here, the same word—that is, "bank"—is used in two different contexts. For text processing results to be accurate, the context of the words needs to be considered.

In the Lesk algorithm, words with ambiguous meanings are stored in the background in synsets. The definition that is closer to the meaning of a word being used in the context of the sentence will be taken as the right definition. Let's perform a simple exercise to get a better idea of how we can implement this.

Exercise 2.17: Implementing the Lesk Algorithm Using String Similarity and Text Vectorization

In this exercise, we are going to implement the Lesk algorithm step by step using the techniques we have learned so far. We will find the meaning of the word "bank" in the sentence, "On the banks of river Ganga, there lies the scent of spirituality." We will use cosine similarity as well as Jaccard similarity here. Follow these steps to complete this exercise:

Open a Jupyter Notebook.
Insert a new cell and add the following code to import the necessary libraries:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np
Define a method for getting the TFIDF vectors of a corpus:
def get_tf_idf_vectors(corpus):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_results = tfidf_vectorizer.fit_transform\
                    (corpus).todense()
    return tfidf_results
Define a method to convert the corpus into lowercase:
def to_lower_case(corpus):
lowercase_corpus = [x.lower() for x in corpus]
return lowercase_corpus
Define a method to find the similarity between the sentence and the possible definitions and return the definition with the highest similarity score:
def find_sentence_definition(sent_vector,defnition_vectors):
    """
    This method will find cosine similarity of sentence with
    the possible definitions and return the one with
    highest similarity score along with the similarity score.
    """
    result_dict = {}
    for definition_id,def_vector in definition_vectors.items():
        sim = cosine_similarity(sent_vector,def_vector)
        result_dict[definition_id] = sim[0][0]
    definition = sorted(result_dict.items(), \
                         key=lambda x: x[1], \
                         reverse=True)[0]
    return definition[0],definition[1]
Define a corpus with random sentences with the sentence and the two definitions as the top three sentences:
corpus = ["On the banks of river Ganga, there lies the scent "\
          "of spirituality",\
          "An institute where people can store extra "\
          "cash or money.",\
          "The land alongside or sloping down to a river or lake"
          "What you do defines you",\
          "Your deeds define you",\
          "Once upon a time there lived a king.",\
          "Who is your queen?",\
          "He is desperate",\
          "Is he not desperate?"]
Use the previously defined methods to find the definition of the word bank:
lower_case_corpus = to_lower_case(corpus)
corpus_tf_idf = get_tf_idf_vectors(lower_case_corpus)
sent_vector = corpus_tf_idf[0]
definition_vectors = {'def1':corpus_tf_idf[1],\
'def2':corpus_tf_idf[2]}
definition_id, score = \
find_sentence_definition(sent_vector,definition_vectors)
print("The definition of word {} is {} with similarity of {}".\
format('bank',definition_id,score))
You will get the following output:
The definition of word bank is def2 with similarity of 0.14419130686278897

As we already know, def2 represents a riverbank. So, we have found the correct definition of the word here. In this exercise, we have learned how to use text vectorization and text similarity to find the right definition of ambiguous words.

Note

To access the source code for this specific section, please refer to https://packt.live/39GzJAs.

You can also run this example online at https://packt.live/3fbxQwK.

Word Clouds

Unlike numeric data, there are very few ways in which text data can be represented visually. The most popular way of visualizing text data is by using word clouds. A word cloud is a visualization of a text corpus in which the sizes of the tokens (words) represent the number of times they have occurred, as shown in the following image:

Figure 2.27: Example of a word cloud

In the following exercise, we will be using a Python library called wordcloud to build a word cloud from the 20newsgroups dataset.

Let's go through an exercise to understand this better.

Exercise 2.18: Generating Word Clouds

In this exercise, we will visualize the most frequently occurring words in the first 1,000 articles from sklearn's fetch_20newsgroups text dataset using a word cloud. Follow these steps to complete this exercise:

Open a Jupyter Notebook.
Import the necessary libraries and dataset. Add the following code to do this:
import nltk
nltk.download('stopwords')
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 200
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 200
Write the get_data() method to fetch the data:
def get_data(n):
    newsgroups_data_sample = fetch_20newsgroups(subset='train')
    text = str(newsgroups_data_sample['data'][:n])
    return text
Add a method to remove stop words:
def load_stop_words():
    other_stopwords_to_remove = ['\\n', 'n', '\\', '>', \
                                 'nLines', 'nI',"n'"]
    stop_words = stopwords.words('english')
    stop_words.extend(other_stopwords_to_remove)
    stop_words = set(stop_words)
    return stop_words
Add the generate_word_cloud() method to generate a word cloud object:
def generate_word_cloud(text, stopwords):
    """
    This method generates word cloud object
    with given corpus, stop words and dimensions
    """
    wordcloud = WordCloud(width = 800, height = 800, \
                          background_color ='white', \
                          max_words=200, \
                          stopwords = stopwords, \
                          min_font_size = 10).generate(text)
    return wordcloud
Get 1,000 documents from the 20newsgroup data, get the stop word list, generate a word cloud object, and finally plot the word cloud with matplotlib:
text = get_data(1000)
stop_words = load_stop_words()
wordcloud = generate_word_cloud(text, stop_words)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
The preceding code generates the following output:

Figure 2.28: Word cloud representation of the first 10 articles

So, in this exercise, we learned what word clouds are and how to generate word clouds with Python's wordcloud library and visualize this with matplotlib.

Note

To access the source code for this specific section, please refer to https://packt.live/30eaSRn.

You can also run this example online at https://packt.live/2EzqLJJ.

In the next section, we will explore other visualizations, such as dependency parse trees and named entities.

Other Visualizations

Apart from word clouds, there are various other ways of visualizing texts. Some of the most popular ways are listed here:

Visualizing sentences using a dependency parse tree: Generally, the phrases constituting a sentence depend on each other. We depict these dependencies by using a tree structure known as a dependency parse tree. For instance, the word "helps" in the sentence "God helps those who help themselves" depends on two other words. These words are "God" (the one who helps) and "those" (the ones who are helped).
Visualizing named entities in a text corpus: In this case, we extract the named entities from texts and highlight them by using different colors.

Let's go through the following exercise to understand this better.

Exercise 2.19: Other Visualizations Dependency Parse Trees and Named Entities

In this exercise, we will look at two of the most popular visualization methods, after word clouds, which are dependency parse trees and using named entities. Follow these steps to complete this exercise:

Open a Jupyter Notebook.
Insert a new cell and add the following code to import the necessary libraries:
import spacy
from spacy import displacy
!python -m spacy download
en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()
Depict the sentence "God helps those who help themselves" using a dependency parse tree with the following code:
doc = nlp('God helps those who help themselves')
displacy.render(doc, style='dep', jupyter=True)
The preceding code generates the following output:

Figure 2.29: Dependency parse tree
Visualize the named entities of the text corpus by adding the following code:
text = 'Once upon a time there lived a saint named '\
       'Ramakrishna Paramahansa. His chief disciple '\
       'Narendranath Dutta also known as Swami Vivekananda '\
       'is the founder of Ramakrishna Mission and '\
       'Ramakrishna Math.'
doc2 = nlp(text)
displacy.render(doc2, style='ent', jupyter=True)
The preceding code generates the following output:

Figure 2.30: Named entities

Note

To access the source code for this specific section, please refer to https://packt.live/313m4iD.

You can also run this example online at https://packt.live/3103fgr.

Now that you have learned about visualizations, we will solve an activity based on them to gain an even better understanding.

Activity 2.02: Text Visualization

In this activity, you will create a word cloud for the 50 most frequent words in a dataset. The dataset we will use consists of random sentences that are not clean. First, we need to clean them and create a unique set of frequently occurring words.

Note

The text_corpus.txt file that's being used in this activity can be found at https://packt.live/2DiVIBj.

Follow these steps to implement this activity:

Import the necessary libraries.
Fetch the dataset.
Perform the preprocessing steps, such as text cleaning, tokenization, and lemmatization, on the fetched data.
Create a set of unique words along with their frequencies for the 50 most frequently occurring words.
Create a word cloud for these top 50 words.
Justify the word cloud by comparing it with the word frequency that you calculated.
Note
The solution to this activity can be found on page 375.