I have a pandas DataFrame and a column contained a string that was separated by a pipe. These were from movie genres. They looked like this:
Genre
Adventure|Animation|Children|Comedy|Fantasy
Comedy|Romance
...
I used str.split to get them back into the cell as a List. Like this:
Genre
[Adventure, Animation, Children, Comedy, Fantasy]
[Adventure, Children, Fantasy]
[Comedy, Romance]
[Comedy, Drama, Romance]
[Comedy]
I want to get a sum of all the genres. For example how many times did Comedy appear? How many times did Adventure and so on? I can't seem to figure this out.
This would look like
Comedy 4
Adventure 2
Animation 1
(...and so on...)
As somebody from the for-loop club, I recommend using python's C-accelerated routines—itertools.chain
, and collections.Counter
—for performance.
from itertools import chain
from collections import Counter
pd.Series(
Counter(chain.from_iterable(x.split('|') for x in df.Genre)))
Adventure 1
Animation 1
Children 1
Comedy 2
Fantasy 1
Romance 1
dtype: int64
Why do I think CPython functions are better than pandas "vectorised" string functions? They are inherently hard to vectorise. You can read more at For loops with pandas - When should I care?.
If you have to deal with NaNs, you can call a function that handles exceptions gracefully:
def try_split(x):
try:
return x.split('|')
except AttributeError:
return []
pd.Series(
Counter(chain.from_iterable(try_split(x) for x in df.Genre)))
pandaically, you would do this with split
, stack
, and value_counts
.
df['Genre'].str.split('|', expand=True).stack().value_counts()
Comedy 2
Romance 1
Children 1
Animation 1
Fantasy 1
Adventure 1
dtype: int64
The timing difference is obvious even for tiny DataFrames.
%timeit df['Genre'].str.get_dummies(sep='|').sum()
%timeit df['Genre'].str.split('|', expand=True).stack().value_counts()
%%timeit
pd.Series(
Counter(chain.from_iterable(try_split(x) for x in df.Genre)))
2.8 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
320 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I'm also in favor of using chain
+for
.
Just to document this, one more possible way is to use get_dummies
df['Genre'].str.get_dummies(sep='|').sum()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With