Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find ngram frequency of a column in a pandas dataframe?

Below is the input pandas dataframe I have.

enter image description here

I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown belowenter image description here

How to do this using nltk or scikit learn?

I wrote the below code which takes a string as input. How to extend it to series/dataframe?

from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()
like image 359
GeorgeOfTheRF Avatar asked Apr 12 '16 11:04

GeorgeOfTheRF


People also ask

How to count the frequency of a column in pandas Dataframe?

Series.values_count () method gets you the count of the frequency of a value that occurs in a column of pandas DataFrame. In order to use this first, you need to get the Series object from DataFrame. df ['column_name'] returns you a Series object.

How to find the W-shingles in a pandas Dataframe?

Suppose we have some text in a Pandas dataframe df column text and want to find the w-shingles. This can be turned into an array using split and then unnested with explode. This would result in one word per line. The index is preserved so you can realign it with the original series.

How to find all sequences of n-grams of length n?

To find all sequences of n-grams; that is contiguous subsequences of length n, from a sequence xs we can use the following function: This works by iterating over all possible starting indices in the list with range, and then extracting the sequence of length n using xs [i:i+n].

Is it possible to create a function with the name 'n-grams'?

n-grams isn't a valid variable/function name in Python. Please check the code you posted. What is the expected output? Also, you create empty counters and empty deques, you don't do anything with it.


1 Answers

If your data is like

import pandas as pd
df = pd.DataFrame([
    'must watch. Good acting',
    'average movie. Bad acting',
    'good movie. Good acting',
    'pathetic. Avoid',
    'avoid'], columns=['description'])

You could use the CountVectorizer of the package sklearn:

from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])

Which gives you :

                frequency
good            3
pathetic        1
average movie   1
movie bad       2
watch           1
good movie      1
watch good      3
good acting     2
must            1
movie good      2
pathetic avoid  1
bad acting      1
average         1
must watch      1
acting          1
bad             1
movie           1
avoid           1

EDIT

fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.

Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.

EDIT 2

For big dataframes, you can speed up the frequencies computation by using:

frequencies = sum(sparse_matrix).data
like image 135
Till Avatar answered Oct 13 '22 21:10

Till