Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate number of words in a string in DataFrame? [duplicate]

Tags:

Suppose we have simple Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits']) df.columns = ['fruits'] 

how to calculate number of words in keywords, similar to:

1 word: 2 2 words: 2 3 words: 1 4 words: 1 
like image 813
Sergei Avatar asked May 27 '16 12:05

Sergei


People also ask

How do you count duplicates in a Dataframe?

To count the number of duplicate rows, use the DataFrame's duplicated(~) method. Here, rows a and c are duplicates.

How do you count occurrences of a string in pandas?

To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

How do you count the number of words in a dataset in Python?

Using df. count() method in pandas we can count the total number of words in a file with columns. Using df. count().


2 Answers

IIUC then you can do the following:

In [89]: count = df['fruits'].str.split().apply(len).value_counts() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count  Out[89]: 1 words:    2 2 words:    2 3 words:    1 4 words:    1 Name: fruits, dtype: int64 

Here we use the vectorised str.split to split on spaces, and then apply len to get the count of the number of elements, we can then call value_counts to aggregate the frequency count.

We then rename the index and sort it to get the desired output

UPDATE

This can also be done using str.len rather than apply which should scale better:

In [41]: count = df['fruits'].str.split().str.len() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count  Out[41]: 0 words:    2 1 words:    1 2 words:    3 3 words:    4 4 words:    2 5 words:    1 Name: fruits, dtype: int64 

Timings

In [42]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len()  1000 loops, best of 3: 799 µs per loop 1000 loops, best of 3: 347 µs per loop 

For a 6K df:

In [51]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len()  100 loops, best of 3: 6.3 ms per loop 100 loops, best of 3: 6 ms per loop 
like image 52
EdChum Avatar answered Oct 07 '22 17:10

EdChum


You could use str.count with space ' ' as delimiter.

In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)  In [1717]: count.index = count.index.astype('str') + ' words:'  In [1718]: count Out[1718]: 1 words:    2 2 words:    2 3 words:    1 4 words:    1 Name: fruits, dtype: int64 

Timings

str.count is marginally faster

Small

In [1724]: df.shape Out[1724]: (6, 1)  In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 1000 loops, best of 3: 649 µs per loop  In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts() 1000 loops, best of 3: 840 µs per loop 

Medium

In [1728]: df.shape Out[1728]: (6000, 1)  In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 100 loops, best of 3: 6.58 ms per loop  In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts() 100 loops, best of 3: 6.99 ms per loop 

Large

In [1732]: df.shape Out[1732]: (60000, 1)  In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 1 loop, best of 3: 57.6 ms per loop  In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts() 1 loop, best of 3: 73.8 ms per loop 
like image 32
Zero Avatar answered Oct 07 '22 15:10

Zero