Count distinct words from a Pandas Data Frame

Tags:

I've a Pandas data frame, where one column contains text. I'd like to get a list of unique words appearing across the entire column (space being the only split).

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

The output should look like this:

['my','nickname','is','ft.jgt','someone','going','to','place']

It wouldn't hurt to get a count as well, but it is not required.

842

asked Sep 21 '13 19:09

ADJ

3 Answers

Use a set to create the sequence of unique elements.

Do some clean-up on df to get the strings in lower case and split:

df['text'].str.lower().str.split()
Out[43]: 
0             [my, nickname, is, ft.jgt]
1    [someone, is, going, to, my, place]

Each list in this column can be passed to set.update function to get unique values. Use apply to do so:

results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)

set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])

Or use with Counter() from comments:

from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)

153

answered Oct 23 '22 11:10

Zeugma

Use collections.Counter:

>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]

answered Oct 23 '22 13:10

Ofir Israel

If you want to do it from the DataFrame construct:

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)

My          1
Someone     1
ft.jgt      1
going       1
is          2
my          1
nickname    1
place       1
to          1
dtype: float64

If you want a more flexible tokenization use nltk and its tokenize

answered Oct 23 '22 11:10

cwharland

Related questions
                            
                                Seeking from end of file throwing unsupported exception
                            
                                Why can't PySpark find py4j.java_gateway?
                            
                                Pythonic way to insert every 2 elements in a string
                            
                                how do I launch IDLE, the development environment for Python, on Mac OS 10.7?
                            
                                Counting Cars OpenCV + Python Issue
                            
                                Python - Using regex to find multiple matches and print them out [duplicate]
                            
                                python xlrd unsupported format, or corrupt file.
                            
                                Get virtualenv's bin folder path from script
                            
                                Keep same dummy variable in training and testing data
                            
                                Good practices in writing MATLAB code? [closed]
                            
                                Using Variables for Class Names in Python?
                            
                                Count indexes using "for" in Python
                            
                                Return number of files in directory and subdirectory
                            
                                In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique *while preserving order*? [duplicate]
                            
                                Python, print delimited list
                            
                                How do I use url_for if my method has multiple route annotations?
                            
                                pandas DataFrame output end of csv
                            
                                Tensorflow._api.v2.train has no attribute 'AdamOptimizer'
                            
                                Concat DataFrame Reindexing only valid with uniquely valued Index objects
                            
                                How to install psycopg2 with pg_config error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count distinct words from a Pandas Data Frame

Tags:

python

text

pandas

ADJ

People also ask

3 Answers

Zeugma

Ofir Israel

cwharland

Recent Activity

Donate For Us