Jaccard Similarity for Texts in a pandas DataFrame

Question

I want to measure the jaccard similarity between texts in a pandas DataFrame. More precisely I have some groups of entities and there is some text for each entity over a period of time. I want to analyse the text similarity (in here the Jaccard similarity) over time, separately for each entity.

A minimal example to illustrate my point:

import pandas as pd

entries = [
    {'Entity_Id':'Firm1', 'date':'2001-02-05', 'text': 'This is a text'},
    {'Entity_Id':'Firm1', 'date':'2001-03-07', 'text': 'This is a text'},
    {'Entity_Id':'Firm1', 'date':'2003-01-04', 'text': 'No similarity'},
    {'Entity_Id':'Firm1', 'date':'2007-10-12', 'text': 'Some similarity'},
    {'Entity_Id':'Firm2', 'date':'2001-10-10', 'text': 'Another firm'},
    {'Entity_Id':'Firm2', 'date':'2005-12-03', 'text': 'Another year'},
    {'Entity_Id':'Firm3', 'date':'2002-05-05', 'text': 'Something different'}
    ]

df = pd.DataFrame(entries)

Entity_Id date text

Firm1   2001-02-05   'This is a text' 
Firm1   2001-03-07   'This is a text'
Firm1   2003-01-04   'No similarity'
Firm1   2007-10-12   'Some similarity'
Firm2   2001-10-10   'Another firm'
Firm2   2005-12-03   'Another year'
Firm3   2002-05-05   'Something different'

My desired output would be something like this:

Entity_Id date text Jaccard

Firm1   2001-02-05   'This is a text'       NaN
Firm1   2001-03-07   'This is a text'       1
Firm1   2003-01-04   'No similarity'        0
Firm1   2007-10-12   'Some similarity'      0.33
Firm2   2001-10-10   'Another firm'         NaN 
Firm2   2005-12-03   'Another year'         0.33  
Firm3   2002-05-05   'Something different'  NaN

That is, i like to compare all text elements within a group of Firms, regardless of the time interval that lays between the texts. I would like to compare it always to the previous text. Therefore the first entry for each firm is always empty as there is no text to compare with.

My approach is to shift the texts by the Entity Identifier by one time interval (the next date available). Then to identify the first report by each Entity and mark this one. (I input the original text for NaNs in text_shifted and delete it later on -> need that for tokenization of the whole column)

df = df.sort_values(['Entity_Id', 'date'], ascending=True)
df['text_shifted'] = df.groupby(['Entity_Id'])['text'].shift(1)
df['IsNaN'] = df['text_shifted'].isnull().astype(int)
df['text_shifted'] = df['text_shifted'].fillna(df['text'])

In the follow i use the jaccard similarity as follows:

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

However i have to tokenize the input first. But if i do something like:

import nltk
df['text_tokens'] = df.text.apply(nltk.word_tokenize)
df['shift_tokens'] = df.text_shifted.apply(nltk.word_tokenize)

It needs years to tokenize the texts in a non-simplified text example where each text has roughly 5000 words and i have about 100 000 texts.

Is there any way that i can speed up the process? Can i avoid the tokenization or better still use sklearn to calculate the similarity?

If I use the cosine similarity as is suggested here: Cosine Similarity row-wise i get my results pretty quick. But i am stuck doing that with jaccard.

Aniket · Accepted Answer

One way to speed up the process could be parallel processing using Pandas on Ray.

You can try NLTK implementation of jaccard_distance for jaccard similarity. I couldn't find any significant improvement in processing time though(for calculating similarity), may work out better on a larger dataset.

Tried comparing NLTK implementation to your custom jaccard similarity function (on 200 text samples of average length 4 words/tokens)

NTLK jaccard_distance:

CPU times: user 3.3 s, sys: 30.3 ms, total: 3.34 s
Wall time: 3.38 s

Custom jaccard similarity implementation:

CPU times: user 3.67 s, sys: 19.2 ms, total: 3.69 s
Wall time: 3.71 s

Jaccard Similarity for Texts in a pandas DataFrame

Tags:

python

pandas

similarity

scikit-learn

sklearn-pandas

alex_rieber

1 Answers

Aniket

Recent Activity

Donate For Us

Jaccard Similarity for Texts in a pandas DataFrame

Tags:

python

pandas

similarity

scikit-learn

sklearn-pandas

alex_rieber

1 Answers

Aniket

Related questions

Recent Activity

Donate For Us