Finding Text Similarity – Application of Feature Extraction
So far in this chapter, we have learned how to generate vectors from text. These vectors are then fed to machine learning algorithms to perform various tasks. Other than using them in machine learning applications, we can also perform simple NLP tasks using these vectors. Finding the string similarity is one of them. This is a technique in which we find the similarity between two strings by converting them into vectors. The technique is mainly used in full-text searching.
There are different techniques for finding the similarity between two strings or texts. They are explained one by one here:
- Cosine similarity: The cosine similarity is a technique to find the similarity between the two vectors by calculating the cosine of the angle between them. As we know, the cosine of a zero-degree angle is 1 (meaning the cosine similarity of two identical vectors is 1), while the cosine of 180 degrees is -1 (meaning the cosine of two opposite vectors is -1). Thus, we can use this cosine angle to find the similarity between the vectors from 1 to -1. To use this technique in finding text similarity, we convert text into vectors using one of the previously discussed techniques and find the similarity between the vectors of the text. This is calculated as follows:
Here, A and B are two vectors, A.B is the dot product of two vectors, and |A| and |B| are the magnitude of two vectors.
- Jaccard similarity: This is another technique that's used to calculate the similarity between the two texts, but it only works on BoW vectors. The Jaccard similarity is calculated as the ratio of the number of terms that are common between two text documents to the total number of unique terms present in those texts.
Consider the following example. Suppose there are two texts:
Text 1: I like detective Byomkesh Bakshi.
Text 2: Byomkesh Bakshi is not a detective; he is a truth seeker.
The common terms are "Byomkesh," "Bakshi," and "detective."
The number of common terms in the texts is three.
The unique terms present in the texts are "I," "like," "is," "not," "a," "he," "is," "truth," and "seeker." So, the number of unique terms is nine.
Therefore, the Jaccard similarity is 3/9 = 0.3.
To get a better understanding of text similarity, we will complete an exercise.
Exercise 2.16: Calculating Text Similarity Using Jaccard and Cosine Similarity
In this exercise, we will calculate the Jaccard and cosine similarity for a given pair of texts. Follow these steps to complete this exercise:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary packages:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lemmatizer = WordNetLemmatizer()
- Create a function to extract the Jaccard similarity between a pair of sentences by adding the following code:
def extract_text_similarity_jaccard(text1, text2):
"""
This method will return Jaccard similarity between two texts
after lemmatizing them.
:param text1: text1
:param text2: text2
:return: similarity measure
"""
lemmatizer = WordNetLemmatizer()
words_text1 = [lemmatizer.lemmatize(word.lower()) \
for word in word_tokenize(text1)]
words_text2 = [lemmatizer.lemmatize(word.lower()) \
for word in word_tokenize(text2)]
nr = len(set(words_text1).intersection(set(words_text2)))
dr = len(set(words_text1).union(set(words_text2)))
jaccard_sim = nr / dr
return jaccard_sim
- Declare three variables named pair1, pair2, and pair3, as follows.
pair1 = ["What you do defines you", "Your deeds define you"]
pair2 = ["Once upon a time there lived a king.", \
"Who is your queen?"]
pair3 = ["He is desperate", "Is he not desperate?"]
- To check the Jaccard similarity between the statements in pair1, write the following code:
extract_text_similarity_jaccard(pair1[0],pair1[1])
The preceding code generates the following output:
0.14285714285714285
- To check the Jaccard similarity between the statements in pair2, write the following code:
extract_text_similarity_jaccard(pair2[0],pair2[1])
The preceding code generates the following output:
0.0
- To check the Jaccard similarity between the statements in pair3, write the following code:
extract_text_similarity_jaccard(pair3[0],pair3[1])
The preceding code generates the following output:
0.6
- To check the cosine similarity, use the TfidfVectorizer() method to get the vectors of each text:
def get_tf_idf_vectors(corpus):
tfidf_vectorizer = TfidfVectorizer()
tfidf_results = tfidf_vectorizer.fit_transform(corpus).\
todense()
return tfidf_results
- Create a corpus as a list of texts and get the TFIDF vectors of each text document. Add the following code to do this:
corpus = [pair1[0], pair1[1], pair2[0], \
pair2[1], pair3[0], pair3[1]]
tf_idf_vectors = get_tf_idf_vectors(corpus)
- To check the cosine similarity between the initial two texts, write the following code:
cosine_similarity(tf_idf_vectors[0],tf_idf_vectors[1])
The preceding code generates the following output:
array([[0.3082764]])
- To check the cosine similarity between the third and fourth texts, write the following code:
cosine_similarity(tf_idf_vectors[2],tf_idf_vectors[3])
The preceding code generates the following output:
array([[0.]])
- To check the cosine similarity between the fifth and sixth texts, write the following code:
cosine_similarity(tf_idf_vectors[4],tf_idf_vectors[5])
The preceding code generates the following output:
array([[0.80368547]])
So, in this exercise, we learned how to check the similarity between texts. As you can see, the texts "He is desperate" and "Is he not desperate?" returned similarity results of 0.80 (meaning they are highly similar), whereas sentences such as "Once upon a time there lived a king." and "Who is your queen?" returned zero as their similarity measure.
Note
To access the source code for this specific section, please refer to https://packt.live/2Eyw0JC.
You can also run this example online at https://packt.live/2XbGRQ3.
Word Sense Disambiguation Using the Lesk Algorithm
The Lesk algorithm is used for resolving word sense disambiguation. Suppose we have a sentence such as "On the bank of river Ganga, there lies the scent of spirituality" and another sentence, "I'm going to withdraw some cash from the bank". Here, the same word—that is, "bank"—is used in two different contexts. For text processing results to be accurate, the context of the words needs to be considered.
In the Lesk algorithm, words with ambiguous meanings are stored in the background in synsets. The definition that is closer to the meaning of a word being used in the context of the sentence will be taken as the right definition. Let's perform a simple exercise to get a better idea of how we can implement this.
Exercise 2.17: Implementing the Lesk Algorithm Using String Similarity and Text Vectorization
In this exercise, we are going to implement the Lesk algorithm step by step using the techniques we have learned so far. We will find the meaning of the word "bank" in the sentence, "On the banks of river Ganga, there lies the scent of spirituality." We will use cosine similarity as well as Jaccard similarity here. Follow these steps to complete this exercise:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np
- Define a method for getting the TFIDF vectors of a corpus:
def get_tf_idf_vectors(corpus):
tfidf_vectorizer = TfidfVectorizer()
tfidf_results = tfidf_vectorizer.fit_transform\
(corpus).todense()
return tfidf_results
- Define a method to convert the corpus into lowercase:
def to_lower_case(corpus):
lowercase_corpus = [x.lower() for x in corpus]
return lowercase_corpus
- Define a method to find the similarity between the sentence and the possible definitions and return the definition with the highest similarity score:
def find_sentence_definition(sent_vector,defnition_vectors):
"""
This method will find cosine similarity of sentence with
the possible definitions and return the one with
highest similarity score along with the similarity score.
"""
result_dict = {}
for definition_id,def_vector in definition_vectors.items():
sim = cosine_similarity(sent_vector,def_vector)
result_dict[definition_id] = sim[0][0]
definition = sorted(result_dict.items(), \
key=lambda x: x[1], \
reverse=True)[0]
return definition[0],definition[1]
- Define a corpus with random sentences with the sentence and the two definitions as the top three sentences:
corpus = ["On the banks of river Ganga, there lies the scent "\
"of spirituality",\
"An institute where people can store extra "\
"cash or money.",\
"The land alongside or sloping down to a river or lake"
"What you do defines you",\
"Your deeds define you",\
"Once upon a time there lived a king.",\
"Who is your queen?",\
"He is desperate",\
"Is he not desperate?"]
- Use the previously defined methods to find the definition of the word bank:
lower_case_corpus = to_lower_case(corpus)
corpus_tf_idf = get_tf_idf_vectors(lower_case_corpus)
sent_vector = corpus_tf_idf[0]
definition_vectors = {'def1':corpus_tf_idf[1],\
'def2':corpus_tf_idf[2]}
definition_id, score = \
find_sentence_definition(sent_vector,definition_vectors)
print("The definition of word {} is {} with similarity of {}".\
format('bank',definition_id,score))
You will get the following output:
The definition of word bank is def2 with similarity of 0.14419130686278897
As we already know, def2 represents a riverbank. So, we have found the correct definition of the word here. In this exercise, we have learned how to use text vectorization and text similarity to find the right definition of ambiguous words.
Note
To access the source code for this specific section, please refer to https://packt.live/39GzJAs.
You can also run this example online at https://packt.live/3fbxQwK.
Word Clouds
Unlike numeric data, there are very few ways in which text data can be represented visually. The most popular way of visualizing text data is by using word clouds. A word cloud is a visualization of a text corpus in which the sizes of the tokens (words) represent the number of times they have occurred, as shown in the following image:
In the following exercise, we will be using a Python library called wordcloud to build a word cloud from the 20newsgroups dataset.
Let's go through an exercise to understand this better.
Exercise 2.18: Generating Word Clouds
In this exercise, we will visualize the most frequently occurring words in the first 1,000 articles from sklearn's fetch_20newsgroups text dataset using a word cloud. Follow these steps to complete this exercise:
- Open a Jupyter Notebook.
- Import the necessary libraries and dataset. Add the following code to do this:
import nltk
nltk.download('stopwords')
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 200
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 200
- Write the get_data() method to fetch the data:
def get_data(n):
newsgroups_data_sample = fetch_20newsgroups(subset='train')
text = str(newsgroups_data_sample['data'][:n])
return text
- Add a method to remove stop words:
def load_stop_words():
other_stopwords_to_remove = ['\\n', 'n', '\\', '>', \
'nLines', 'nI',"n'"]
stop_words = stopwords.words('english')
stop_words.extend(other_stopwords_to_remove)
stop_words = set(stop_words)
return stop_words
- Add the generate_word_cloud() method to generate a word cloud object:
def generate_word_cloud(text, stopwords):
"""
This method generates word cloud object
with given corpus, stop words and dimensions
"""
wordcloud = WordCloud(width = 800, height = 800, \
background_color ='white', \
max_words=200, \
stopwords = stopwords, \
min_font_size = 10).generate(text)
return wordcloud
- Get 1,000 documents from the 20newsgroup data, get the stop word list, generate a word cloud object, and finally plot the word cloud with matplotlib:
text = get_data(1000)
stop_words = load_stop_words()
wordcloud = generate_word_cloud(text, stop_words)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
The preceding code generates the following output:
So, in this exercise, we learned what word clouds are and how to generate word clouds with Python's wordcloud library and visualize this with matplotlib.
Note
To access the source code for this specific section, please refer to https://packt.live/30eaSRn.
You can also run this example online at https://packt.live/2EzqLJJ.
In the next section, we will explore other visualizations, such as dependency parse trees and named entities.
Other Visualizations
Apart from word clouds, there are various other ways of visualizing texts. Some of the most popular ways are listed here:
- Visualizing sentences using a dependency parse tree: Generally, the phrases constituting a sentence depend on each other. We depict these dependencies by using a tree structure known as a dependency parse tree. For instance, the word "helps" in the sentence "God helps those who help themselves" depends on two other words. These words are "God" (the one who helps) and "those" (the ones who are helped).
- Visualizing named entities in a text corpus: In this case, we extract the named entities from texts and highlight them by using different colors.
Let's go through the following exercise to understand this better.
Exercise 2.19: Other Visualizations Dependency Parse Trees and Named Entities
In this exercise, we will look at two of the most popular visualization methods, after word clouds, which are dependency parse trees and using named entities. Follow these steps to complete this exercise:
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
import spacy
from spacy import displacy
!python -m spacy download
en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()
- Depict the sentence "God helps those who help themselves" using a dependency parse tree with the following code:
doc = nlp('God helps those who help themselves')
displacy.render(doc, style='dep', jupyter=True)
The preceding code generates the following output:
- Visualize the named entities of the text corpus by adding the following code:
text = 'Once upon a time there lived a saint named '\
'Ramakrishna Paramahansa. His chief disciple '\
'Narendranath Dutta also known as Swami Vivekananda '\
'is the founder of Ramakrishna Mission and '\
'Ramakrishna Math.'
doc2 = nlp(text)
displacy.render(doc2, style='ent', jupyter=True)
The preceding code generates the following output:
Note
To access the source code for this specific section, please refer to https://packt.live/313m4iD.
You can also run this example online at https://packt.live/3103fgr.
Now that you have learned about visualizations, we will solve an activity based on them to gain an even better understanding.
Activity 2.02: Text Visualization
In this activity, you will create a word cloud for the 50 most frequent words in a dataset. The dataset we will use consists of random sentences that are not clean. First, we need to clean them and create a unique set of frequently occurring words.
Note
The text_corpus.txt file that's being used in this activity can be found at https://packt.live/2DiVIBj.
Follow these steps to implement this activity:
- Import the necessary libraries.
- Fetch the dataset.
- Perform the preprocessing steps, such as text cleaning, tokenization, and lemmatization, on the fetched data.
- Create a set of unique words along with their frequencies for the 50 most frequently occurring words.
- Create a word cloud for these top 50 words.
- Justify the word cloud by comparing it with the word frequency that you calculated.
Note
The solution to this activity can be found on page 375.