Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to count total occurrences of values in a column of lists in pandas?

Tags:

python

pandas

I have a column in a pandas dataframe which contains a large number of lists of labels:

>>> data['SPLIT'].head(10)
0    [33.23, 88.72, 38.93, E931.7, V09.0, 041.11, 5...
1    [99.04, 38.06, 39.57, 00.91, 55.69, V15.82, 27...
2    [96.04, 96.72, 401.9, 276.5, 584.9, 428.0, 507...
3    [96.6, 99.15, 99.83, V29.0, 765.15, 765.25, 77...
4    [96.71, 96.04, 54.12, 99.60, 38.93, 99.15, 53....
5    [88.72, 37.61, 39.61, 36.15, 36.12, 272.0, 401...
6    [38.93, 88.72, 37.31, 272.4, 719.46, 722.0, 31...
7    [88.72, 39.61, 35.71, 272.4, V12.59, 458.29, 7...
8    [97.44, 99.04, 88.56, 37.23, 39.95, 38.95, 00....
9    [00.14, 89.61, 39.95, E878.8, 244.9, 443.9, 18...

What I am trying to do is iterate through all of these lists to find the total occurrence of each value so that I can find the 50 most frequently occurring values.

This is the code I have used which runs extremely slowly:

test = pd.Series(sum([item for item in data.SPLIT], [])).value_counts()

I tried to write a function outside to loop through the lists and find the count as well, but this was also very slow.

Is there any way I can modify this data or use a function in pandas that has similar performance to something like df.groupby.count()?

I did search for a half hour on google and stackoverflow but none of the answers have better performance. I have been trying for a long time to figure out a way to flatten the list or find a way to map the counts with better speed (iterating through 500k rows, and the length of each list varies, some can be as long as 512, others as short as 2).

like image 904
Ben C Wang Avatar asked Jan 02 '23 01:01

Ben C Wang


1 Answers

Use list comprehension with flattening instead sum:

test = pd.Series([x for item in data.SPLIT for x in item]).value_counts()

Or flatten by chain.from_iterable:

from itertools import chain

test = pd.Series(list(chain.from_iterable(data.SPLIT))).value_counts()

Or use also collections.Counter:

from itertools import chain
from collections import Counter

test = pd.Series(Counter(chain.from_iterable(data.SPLIT)))

Or:

import functools, operator

test = pd.Series(functools.reduce(operator.iconcat, data.SPLIT, [])).value_counts()

Pure pandas solution:

test = pd.DataFrame(data.SPLIT.values.tolist()).stack().value_counts()
like image 139
jezrael Avatar answered Jan 08 '23 00:01

jezrael