I want to measure the jaccard similarity between texts in a pandas DataFrame. More precisely I have some groups of entities and there is some text for each entity over a period of time. I want to analyse the text similarity (in here the Jaccard similarity) over time, separately for each entity.
A minimal example to illustrate my point:
import pandas as pd
entries = [
{'Entity_Id':'Firm1', 'date':'2001-02-05', 'text': 'This is a text'},
{'Entity_Id':'Firm1', 'date':'2001-03-07', 'text': 'This is a text'},
{'Entity_Id':'Firm1', 'date':'2003-01-04', 'text': 'No similarity'},
{'Entity_Id':'Firm1', 'date':'2007-10-12', 'text': 'Some similarity'},
{'Entity_Id':'Firm2', 'date':'2001-10-10', 'text': 'Another firm'},
{'Entity_Id':'Firm2', 'date':'2005-12-03', 'text': 'Another year'},
{'Entity_Id':'Firm3', 'date':'2002-05-05', 'text': 'Something different'}
]
df = pd.DataFrame(entries)
Entity_Id date text
Firm1 2001-02-05 'This is a text'
Firm1 2001-03-07 'This is a text'
Firm1 2003-01-04 'No similarity'
Firm1 2007-10-12 'Some similarity'
Firm2 2001-10-10 'Another firm'
Firm2 2005-12-03 'Another year'
Firm3 2002-05-05 'Something different'
My desired output would be something like this:
Entity_Id date text Jaccard
Firm1 2001-02-05 'This is a text' NaN
Firm1 2001-03-07 'This is a text' 1
Firm1 2003-01-04 'No similarity' 0
Firm1 2007-10-12 'Some similarity' 0.33
Firm2 2001-10-10 'Another firm' NaN
Firm2 2005-12-03 'Another year' 0.33
Firm3 2002-05-05 'Something different' NaN
That is, i like to compare all text elements within a group of Firms, regardless of the time interval that lays between the texts. I would like to compare it always to the previous text. Therefore the first entry for each firm is always empty as there is no text to compare with.
My approach is to shift the texts by the Entity Identifier by one time interval (the next date available). Then to identify the first report by each Entity and mark this one. (I input the original text for NaNs in text_shifted and delete it later on -> need that for tokenization of the whole column)
df = df.sort_values(['Entity_Id', 'date'], ascending=True)
df['text_shifted'] = df.groupby(['Entity_Id'])['text'].shift(1)
df['IsNaN'] = df['text_shifted'].isnull().astype(int)
df['text_shifted'] = df['text_shifted'].fillna(df['text'])
In the follow i use the jaccard similarity as follows:
def jaccard_similarity(query, document):
intersection = set(query).intersection(set(document))
union = set(query).union(set(document))
return len(intersection)/len(union)
However i have to tokenize the input first. But if i do something like:
import nltk
df['text_tokens'] = df.text.apply(nltk.word_tokenize)
df['shift_tokens'] = df.text_shifted.apply(nltk.word_tokenize)
It needs years to tokenize the texts in a non-simplified text example where each text has roughly 5000 words and i have about 100 000 texts.
Is there any way that i can speed up the process? Can i avoid the tokenization or better still use sklearn to calculate the similarity?
If I use the cosine similarity as is suggested here: Cosine Similarity row-wise i get my results pretty quick. But i am stuck doing that with jaccard.
One way to speed up the process could be parallel processing using Pandas on Ray.
You can try NLTK implementation of jaccard_distance for jaccard similarity. I couldn't find any significant improvement in processing time though(for calculating similarity), may work out better on a larger dataset.
Tried comparing NLTK implementation to your custom jaccard similarity function (on 200 text samples of average length 4 words/tokens)
NTLK jaccard_distance:
CPU times: user 3.3 s, sys: 30.3 ms, total: 3.34 s
Wall time: 3.38 s
Custom jaccard similarity implementation:
CPU times: user 3.67 s, sys: 19.2 ms, total: 3.69 s
Wall time: 3.71 s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With