Suppose we have simple Dataframe
df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits']) df.columns = ['fruits']
how to calculate number of words in keywords, similar to:
1 word: 2 2 words: 2 3 words: 1 4 words: 1
To count the number of duplicate rows, use the DataFrame's duplicated(~) method. Here, rows a and c are duplicates.
To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
Using df. count() method in pandas we can count the total number of words in a file with columns. Using df. count().
IIUC then you can do the following:
In [89]: count = df['fruits'].str.split().apply(len).value_counts() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count Out[89]: 1 words: 2 2 words: 2 3 words: 1 4 words: 1 Name: fruits, dtype: int64
Here we use the vectorised str.split
to split on spaces, and then apply
len
to get the count of the number of elements, we can then call value_counts
to aggregate the frequency count.
We then rename the index and sort it to get the desired output
UPDATE
This can also be done using str.len
rather than apply
which should scale better:
In [41]: count = df['fruits'].str.split().str.len() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count Out[41]: 0 words: 2 1 words: 1 2 words: 3 3 words: 4 4 words: 2 5 words: 1 Name: fruits, dtype: int64
Timings
In [42]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len() 1000 loops, best of 3: 799 µs per loop 1000 loops, best of 3: 347 µs per loop
For a 6K df:
In [51]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len() 100 loops, best of 3: 6.3 ms per loop 100 loops, best of 3: 6 ms per loop
You could use str.count
with space ' '
as delimiter.
In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False) In [1717]: count.index = count.index.astype('str') + ' words:' In [1718]: count Out[1718]: 1 words: 2 2 words: 2 3 words: 1 4 words: 1 Name: fruits, dtype: int64
Timings
str.count
is marginally faster
Small
In [1724]: df.shape Out[1724]: (6, 1) In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 1000 loops, best of 3: 649 µs per loop In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts() 1000 loops, best of 3: 840 µs per loop
Medium
In [1728]: df.shape Out[1728]: (6000, 1) In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 100 loops, best of 3: 6.58 ms per loop In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts() 100 loops, best of 3: 6.99 ms per loop
Large
In [1732]: df.shape Out[1732]: (60000, 1) In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 1 loop, best of 3: 57.6 ms per loop In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts() 1 loop, best of 3: 73.8 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With