Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jaccard Similarity for Texts in a pandas DataFrame

I want to measure the jaccard similarity between texts in a pandas DataFrame. More precisely I have some groups of entities and there is some text for each entity over a period of time. I want to analyse the text similarity (in here the Jaccard similarity) over time, separately for each entity.

A minimal example to illustrate my point:


import pandas as pd

entries = [
    {'Entity_Id':'Firm1', 'date':'2001-02-05', 'text': 'This is a text'},
    {'Entity_Id':'Firm1', 'date':'2001-03-07', 'text': 'This is a text'},
    {'Entity_Id':'Firm1', 'date':'2003-01-04', 'text': 'No similarity'},
    {'Entity_Id':'Firm1', 'date':'2007-10-12', 'text': 'Some similarity'},
    {'Entity_Id':'Firm2', 'date':'2001-10-10', 'text': 'Another firm'},
    {'Entity_Id':'Firm2', 'date':'2005-12-03', 'text': 'Another year'},
    {'Entity_Id':'Firm3', 'date':'2002-05-05', 'text': 'Something different'}
    ]

df = pd.DataFrame(entries)

Entity_Id date text

Firm1   2001-02-05   'This is a text' 
Firm1   2001-03-07   'This is a text'
Firm1   2003-01-04   'No similarity'
Firm1   2007-10-12   'Some similarity'
Firm2   2001-10-10   'Another firm'
Firm2   2005-12-03   'Another year'
Firm3   2002-05-05   'Something different'

My desired output would be something like this:

Entity_Id date text Jaccard

Firm1   2001-02-05   'This is a text'       NaN
Firm1   2001-03-07   'This is a text'       1
Firm1   2003-01-04   'No similarity'        0
Firm1   2007-10-12   'Some similarity'      0.33
Firm2   2001-10-10   'Another firm'         NaN 
Firm2   2005-12-03   'Another year'         0.33  
Firm3   2002-05-05   'Something different'  NaN 

That is, i like to compare all text elements within a group of Firms, regardless of the time interval that lays between the texts. I would like to compare it always to the previous text. Therefore the first entry for each firm is always empty as there is no text to compare with.

My approach is to shift the texts by the Entity Identifier by one time interval (the next date available). Then to identify the first report by each Entity and mark this one. (I input the original text for NaNs in text_shifted and delete it later on -> need that for tokenization of the whole column)

df = df.sort_values(['Entity_Id', 'date'], ascending=True)
df['text_shifted'] = df.groupby(['Entity_Id'])['text'].shift(1)
df['IsNaN'] = df['text_shifted'].isnull().astype(int)
df['text_shifted'] = df['text_shifted'].fillna(df['text'])

In the follow i use the jaccard similarity as follows:

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

However i have to tokenize the input first. But if i do something like:

import nltk
df['text_tokens'] = df.text.apply(nltk.word_tokenize)
df['shift_tokens'] = df.text_shifted.apply(nltk.word_tokenize)

It needs years to tokenize the texts in a non-simplified text example where each text has roughly 5000 words and i have about 100 000 texts.

Is there any way that i can speed up the process? Can i avoid the tokenization or better still use sklearn to calculate the similarity?

If I use the cosine similarity as is suggested here: Cosine Similarity row-wise i get my results pretty quick. But i am stuck doing that with jaccard.

like image 582
alex_rieber Avatar asked Oct 30 '22 01:10

alex_rieber


1 Answers

One way to speed up the process could be parallel processing using Pandas on Ray.

You can try NLTK implementation of jaccard_distance for jaccard similarity. I couldn't find any significant improvement in processing time though(for calculating similarity), may work out better on a larger dataset.

Tried comparing NLTK implementation to your custom jaccard similarity function (on 200 text samples of average length 4 words/tokens)

NTLK jaccard_distance:

CPU times: user 3.3 s, sys: 30.3 ms, total: 3.34 s
Wall time: 3.38 s

Custom jaccard similarity implementation:

CPU times: user 3.67 s, sys: 19.2 ms, total: 3.69 s
Wall time: 3.71 s
like image 74
Aniket Avatar answered Nov 09 '22 12:11

Aniket