Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - compare n-grams across multiple text files

Tags:

python

n-gram

First time poster - I am a new Python user with limited programming skills. Ultimately I am trying to identify and compare n-grams across numerous text documents found in the same directory. My analysis is somewhat similar to plagiarism detection - I want to calculate the percentage of text documents in which a particular n-gram can be found. For now, I am attempting a simpler version of the larger problem, trying to compare n-grams across two text documents. I have no problem identifying the n-grams but I am struggling to compare across the two documents. Is there a way to store the n-grams in a list to effectively compare which ones are present in the two documents? Here's what I've done so far (forgive the naive coding). For reference, I provide basic sentences below as opposed to the text documents I am actually reading in my code.

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.split(), n)
trigrams2 = ngrams(text2.split(), n)

print(trigrams1)
for grams in trigrams1:
    print(grams)

def compare(trigrams1, trigrams2):
    for grams1 in trigrams1:
        if each_gram in trigrams2:
            print (each_gram)
    return False 

Thanks to everyone for your help!

like image 332
jason623 Avatar asked Nov 09 '22 22:11

jason623


1 Answers

Use a list say common in the compare function. Append each ngram to this list that is common to both trigrams and finally return the list as:

>>> trigrams1 = ngrams(text1.lower().split(), n)  # use text1.lower() to ignore sentence case.
>>> trigrams2 = ngrams(text2.lower().split(), n)  # use text2.lower() to ignore sentence case.
>>> trigrams1
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'jason')]
>>> trigrams2
[('my', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'mike')]
>>> def compare(trigrams1, trigrams2):
...    common=[]
...    for grams1 in trigrams1:
...       if grams1 in trigrams2:
...         common.append(grams1)
...    return common
... 
>>> compare(trigrams1, trigrams2)
[('my', 'name', 'is')]
like image 147
Irshad Bhat Avatar answered Nov 14 '22 23:11

Irshad Bhat