I have a column in a pandas dataframe which contains a large number of lists of labels:
>>> data['SPLIT'].head(10)
0 [33.23, 88.72, 38.93, E931.7, V09.0, 041.11, 5...
1 [99.04, 38.06, 39.57, 00.91, 55.69, V15.82, 27...
2 [96.04, 96.72, 401.9, 276.5, 584.9, 428.0, 507...
3 [96.6, 99.15, 99.83, V29.0, 765.15, 765.25, 77...
4 [96.71, 96.04, 54.12, 99.60, 38.93, 99.15, 53....
5 [88.72, 37.61, 39.61, 36.15, 36.12, 272.0, 401...
6 [38.93, 88.72, 37.31, 272.4, 719.46, 722.0, 31...
7 [88.72, 39.61, 35.71, 272.4, V12.59, 458.29, 7...
8 [97.44, 99.04, 88.56, 37.23, 39.95, 38.95, 00....
9 [00.14, 89.61, 39.95, E878.8, 244.9, 443.9, 18...
What I am trying to do is iterate through all of these lists to find the total occurrence of each value so that I can find the 50 most frequently occurring values.
This is the code I have used which runs extremely slowly:
test = pd.Series(sum([item for item in data.SPLIT], [])).value_counts()
I tried to write a function outside to loop through the lists and find the count as well, but this was also very slow.
Is there any way I can modify this data or use a function in pandas that has similar performance to something like df.groupby.count()
?
I did search for a half hour on google and stackoverflow but none of the answers have better performance. I have been trying for a long time to figure out a way to flatten the list or find a way to map the counts with better speed (iterating through 500k rows, and the length of each list varies, some can be as long as 512, others as short as 2).
Use list comprehension with flattening instead sum
:
test = pd.Series([x for item in data.SPLIT for x in item]).value_counts()
Or flatten by chain.from_iterable
:
from itertools import chain
test = pd.Series(list(chain.from_iterable(data.SPLIT))).value_counts()
Or use also collections.Counter
:
from itertools import chain
from collections import Counter
test = pd.Series(Counter(chain.from_iterable(data.SPLIT)))
Or:
import functools, operator
test = pd.Series(functools.reduce(operator.iconcat, data.SPLIT, [])).value_counts()
Pure pandas solution:
test = pd.DataFrame(data.SPLIT.values.tolist()).stack().value_counts()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With