Below is the input pandas dataframe I have.
I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below
How to do this using nltk or scikit learn?
I wrote the below code which takes a string as input. How to extend it to series/dataframe?
from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()
Series.values_count () method gets you the count of the frequency of a value that occurs in a column of pandas DataFrame. In order to use this first, you need to get the Series object from DataFrame. df ['column_name'] returns you a Series object.
Suppose we have some text in a Pandas dataframe df column text and want to find the w-shingles. This can be turned into an array using split and then unnested with explode. This would result in one word per line. The index is preserved so you can realign it with the original series.
To find all sequences of n-grams; that is contiguous subsequences of length n, from a sequence xs we can use the following function: This works by iterating over all possible starting indices in the list with range, and then extracting the sequence of length n using xs [i:i+n].
n-grams isn't a valid variable/function name in Python. Please check the code you posted. What is the expected output? Also, you create empty counters and empty deques, you don't do anything with it.
If your data is like
import pandas as pd
df = pd.DataFrame([
'must watch. Good acting',
'average movie. Bad acting',
'good movie. Good acting',
'pathetic. Avoid',
'avoid'], columns=['description'])
You could use the CountVectorizer
of the package sklearn
:
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
Which gives you :
frequency
good 3
pathetic 1
average movie 1
movie bad 2
watch 1
good movie 1
watch good 3
good acting 2
must 1
movie good 2
pathetic avoid 1
bad acting 1
average 1
must watch 1
acting 1
bad 1
movie 1
avoid 1
EDIT
fit
will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform
can take a new document and create vector of frequency based on the vectorizer vocabulary.
Here your training set is your output set, so you can do both at the same time (fit_transform
). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum
.
EDIT 2
For big dataframes, you can speed up the frequencies computation by using:
frequencies = sum(sparse_matrix).data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With