I've a Pandas data frame, where one column contains text. I'd like to get a list of unique words appearing across the entire column (space being the only split).
import pandas as pd
r1=['My nickname is ft.jgt','Someone is going to my place']
df=pd.DataFrame(r1,columns=['text'])
The output should look like this:
['my','nickname','is','ft.jgt','someone','going','to','place']
It wouldn't hurt to get a count as well, but it is not required.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
Use a set
to create the sequence of unique elements.
Do some clean-up on df
to get the strings in lower case and split:
df['text'].str.lower().str.split()
Out[43]:
0 [my, nickname, is, ft.jgt]
1 [someone, is, going, to, my, place]
Each list in this column can be passed to set.update
function to get unique values. Use apply
to do so:
results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)
set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])
Or use with Counter()
from comments:
from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)
Use collections.Counter
:
>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]
If you want to do it from the DataFrame construct:
import pandas as pd
r1=['My nickname is ft.jgt','Someone is going to my place']
df=pd.DataFrame(r1,columns=['text'])
df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)
My 1
Someone 1
ft.jgt 1
going 1
is 2
my 1
nickname 1
place 1
to 1
dtype: float64
If you want a more flexible tokenization use nltk
and its tokenize
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With