I have this Pandas DataFrame that has a column with lists:
>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]})
>>> df
m
0 [1, 2, 3]
1 [5, 3, 2]
2 [2, 5]
3 [3, 8, 1]
4 [9]
5 [2, 6, 3]
I want to count the number of times a list v = [2, 3]
is contained in the lists of the DataFrame. So in this example the correct answer would be 3
. Now this is just an example, in my actual data the df['m']
can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of v
contain no duplicates and neither do the lists of m
, so they can be sets instead of lists.
The first iteration of my program iterated over each row and checked all(e in data['m'][i] for e in v)
and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster.
So for my next iteration I added a column to the DataFrame that contains a copy of the list v
:
>>> df['V'] = [[2, 3]] * len(df)
>>> df
V m
0 [2, 3] [1, 2, 3]
1 [2, 3] [5, 3, 2]
2 [2, 3] [2, 5]
3 [2, 3] [3, 8, 1]
4 [2, 3] [9]
5 [2, 3] [2, 6, 3]
and a helper function that simply returns the containment boolean like I did before:
def all_helper(l1, l2):
return all(v in l1 for v in l2)
which I can then use with np.vectorize
to add a column with the boolean value:
df['bool'] = np.vectorize(all_helper)(df['m'], df['V'])
And lastly, calculate the sum of these booleans with a simple df['bool'].sum()
I also tried to use .apply()
:
df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1)
count = df['bool'].sum()
but this was slower than the vectorisation.
Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if len(union(df['m'], df['V'])) == len(df['m'])
or something, but I don't know how and if that's the best solution.
Edit: Since somebody asked; here's an example with strings instead of integers:
>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]})
>>> v = ["aa", "ac"]
>>> df
m
0 ["aa", "ab", "ac"]
1 ["aa", "ac", "ad"]
2 ["ba", "bb"]
3 ["ac", "ca", "cc"]
4 ["aa"]
5 ["ac", "da", "aa"]
>>> count_occurrence(df, v)
3
But if you want a more extensive DataFrame, you can generate it with this:
import string
n = 10000
df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]})
v = ["abc", 'cde']
print(count_occurrence(df, v))
Edit:
Neither Divakar's or Vaishali's solution was faster than the one that uses np.vectorize
. Wonder if anyone can beat it.
Jon Clements came with a solution that is roughly 30% faster and much cleaner: df.m.apply(set(v).issubset).sum()
. I continue looking for faster implementations, but this is a step in the right direction.
Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.
count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.
The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
You can utilise DataFrame.apply
along with the builtin set.issubset
method and then .sum()
which all operate at a lower level (normally C level) than Python equivalents do.
subset_wanted = {2, 3}
count = df.m.apply(subset_wanted.issubset).sum()
I can't see shaving more time off that than writing a custom C-level function which'd be the equivalent of a custom sum
with a check there's a subset to determine 0/1 on a row by row basis. At which point, you could have run this thousands upon thousands of times anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With