Suppose we have simple Dataframe <pre class="prettyprint"><code>df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits']) df.columns = ['fruits'] </code></pre> how to calculate number of words in keywords, similar to: <pre class="prettyprint"><code>1 word: 2 2 words: 2 3 words: 1 4 words: 1 </code></pre>

IIUC then you can do the following: <pre class="prettyprint"><code>In [89]: count = df['fruits'].str.split().apply(len).value_counts() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count Out[89]: 1 words: 2 2 words: 2 3 words: 1 4 words: 1 Name: fruits, dtype: int64 </code></pre> Here we use the vectorised <code>str.split</code> to split on spaces, and then <code>apply</code> <code>len</code> to get the count of the number of elements, we can then call <code>value_counts</code> to aggregate the frequency count. We then rename the index and sort it to get the desired output UPDATE This can also be done using <code>str.len</code> rather than <code>apply</code> which should scale better: <pre class="prettyprint"><code>In [41]: count = df['fruits'].str.split().str.len() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count Out[41]: 0 words: 2 1 words: 1 2 words: 3 3 words: 4 4 words: 2 5 words: 1 Name: fruits, dtype: int64 </code></pre> Timings <pre class="prettyprint"><code>In [42]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len() 1000 loops, best of 3: 799 µs per loop 1000 loops, best of 3: 347 µs per loop </code></pre> For a 6K df: <pre class="prettyprint"><code>In [51]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len() 100 loops, best of 3: 6.3 ms per loop 100 loops, best of 3: 6 ms per loop </code></pre>

How to calculate number of words in a string in DataFrame? [duplicate]

Tags:

Suppose we have simple Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits']) df.columns = ['fruits']

how to calculate number of words in keywords, similar to:

1 word: 2 2 words: 2 3 words: 1 4 words: 1

813

asked May 27 '16 12:05

Sergei

2 Answers

IIUC then you can do the following:

In [89]: count = df['fruits'].str.split().apply(len).value_counts() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count  Out[89]: 1 words:    2 2 words:    2 3 words:    1 4 words:    1 Name: fruits, dtype: int64

Here we use the vectorised str.split to split on spaces, and then apply len to get the count of the number of elements, we can then call value_counts to aggregate the frequency count.

We then rename the index and sort it to get the desired output

UPDATE

This can also be done using str.len rather than apply which should scale better:

In [41]: count = df['fruits'].str.split().str.len() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count  Out[41]: 0 words:    2 1 words:    1 2 words:    3 3 words:    4 4 words:    2 5 words:    1 Name: fruits, dtype: int64

Timings

In [42]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len()  1000 loops, best of 3: 799 µs per loop 1000 loops, best of 3: 347 µs per loop

For a 6K df:

In [51]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len()  100 loops, best of 3: 6.3 ms per loop 100 loops, best of 3: 6 ms per loop

answered Oct 07 '22 17:10

EdChum

You could use str.count with space ' ' as delimiter.

In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)  In [1717]: count.index = count.index.astype('str') + ' words:'  In [1718]: count Out[1718]: 1 words:    2 2 words:    2 3 words:    1 4 words:    1 Name: fruits, dtype: int64

Timings

str.count is marginally faster

_Small

In [1724]: df.shape Out[1724]: (6, 1)  In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 1000 loops, best of 3: 649 µs per loop  In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts() 1000 loops, best of 3: 840 µs per loop

_Medium

In [1728]: df.shape Out[1728]: (6000, 1)  In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 100 loops, best of 3: 6.58 ms per loop  In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts() 100 loops, best of 3: 6.99 ms per loop

_Large

In [1732]: df.shape Out[1732]: (60000, 1)  In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 1 loop, best of 3: 57.6 ms per loop  In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts() 1 loop, best of 3: 73.8 ms per loop

answered Oct 07 '22 15:10

Zero

Related questions
                            
                                Bootstrap's JavaScript requires jQuery version 1.9.1 or higher, but lower than version 3
                            
                                Change logging "print" function to "tqdm.write" so logging doesn't interfere with progress bars
                            
                                Python Pandas dataframe reading exact specified range in an excel sheet
                            
                                Search recursively for value in object by property name
                            
                                Can a custom view be used as a TabItem?
                            
                                How to match struct fields in Rust?
                            
                                Download data from a jupyter server
                            
                                cannot resolve symbol 'LocationServices'
                            
                                Import color variables to my styles
                            
                                Macros in the Airflow Python operator
                            
                                Android Studio 3 - It is possible to take a screenshot or record screen?
                            
                                Angular - including CSS file in index.html

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With