How to find ngram frequency of a column in a pandas dataframe?

Tags:

Below is the input pandas dataframe I have.

enter image description here

I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below enter image description here

How to do this using nltk or scikit learn?

I wrote the below code which takes a string as input. How to extend it to series/dataframe?

from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()

359

asked Apr 12 '16 11:04

GeorgeOfTheRF

1 Answers

If your data is like

import pandas as pd
df = pd.DataFrame([
    'must watch. Good acting',
    'average movie. Bad acting',
    'good movie. Good acting',
    'pathetic. Avoid',
    'avoid'], columns=['description'])

You could use the CountVectorizer of the package sklearn:

from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])

Which gives you :

                frequency
good            3
pathetic        1
average movie   1
movie bad       2
watch           1
good movie      1
watch good      3
good acting     2
must            1
movie good      2
pathetic avoid  1
bad acting      1
average         1
must watch      1
acting          1
bad             1
movie           1
avoid           1

EDIT

fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.

Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.

EDIT 2

For big dataframes, you can speed up the frequencies computation by using:

frequencies = sum(sparse_matrix).data

135

answered Oct 13 '22 21:10

Till

Related questions
                            
                                Store Excel file exported from Pandas in AWS
                            
                                Append values from dataframe column to list
                            
                                How to find the last non zero element in every column throughout dataframe?
                            
                                Pandas: where's the memory leak here?
                            
                                How to do date_range in reverse?
                            
                                Pandas: how to convert an index of int64 years to datetime
                            
                                Pandas - find index of value anywhere in DataFrame
                            
                                Exclude first row when importing data from excel into Python
                            
                                Drop rows by index from dataframe
                            
                                Level NaN must be same as name
                            
                                Pandas dataframe: Remove secondary upcoming same value
                            
                                Replacing Rows in Pandas DataFrame with Other DataFrame Based on Index
                            
                                df.loc more than 2 conditions
                            
                                How to convert a pandas MultiIndex DataFrame into a 3D array
                            
                                Read file of repeated "key=value" pairs into DataFrame
                            
                                Pandas: Getting "TypeError: only integer scalar arrays can be converted to a scalar index" while trying to merge data frames
                            
                                Unable to save DataFrame to HDF5 ("object header message is too large")
                            
                                How to fill in rows with repeating data in pandas?
                            
                                Converting PANDAS dataframe from monthly to daily
                            
                                Python How to use ExcelWriter to write into an existing worksheet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find ngram frequency of a column in a pandas dataframe?

Tags:

pandas

nlp

nltk

text-mining

scikit-learn

GeorgeOfTheRF

People also ask

1 Answers

Till

Recent Activity

Donate For Us