I have a data set as follows:
"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"
and so on
I want to find out the most frequently occurring word-pairs e.g.
(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)
The two words could be in any order and at any distance from each other
Can someone suggest a possible solution in python? This is a very large data set.
Any suggestion is highly appreciated
So this is what I tried after suggestions from @275365
@275365 I tried the following with input read from a file
def collect_pairs(file):
pair_counter = Counter()
for line in open(file):
unique_tokens = sorted(set(line))
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
print pair_counter
file = ('myfileComb.txt')
p=collect_pairs(file)
text file has same number of lines as the original one but has only unique tokens in a particular line. I don't know what am I doing wrong since when I run this it splits the words in letters rather than giving output as combinations of words. When I run this file it outputs split letters rather than combinations of words as expected. I dont know where I am making a mistake.
You might start with something like this, depending on how large your corpus is:
>>> from itertools import combinations
>>> from collections import Counter
>>> def collect_pairs(lines):
pair_counter = Counter()
for line in lines:
unique_tokens = sorted(set(line)) # exclude duplicates in same line and sort to ensure one word is always before other
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter
The result:
>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]
Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.
EDIT: Working with a file object
The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval
. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval
:
from itertools import combinations
from collections import Counter
import ast
def collect_pairs(file_name):
pair_counter = Counter()
for line in open(file_name): # these lines are each simply one long string; you need a list or tuple
unique_tokens = sorted(set(ast.literal_eval(line))) # eval will convert each line into a tuple before converting the tuple to a set
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter # return the actual Counter object
Now you can test it like this:
file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10) # for example
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With