Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas counting occurrence of list contained in column of lists

I have this Pandas DataFrame that has a column with lists:

>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]})
>>> df
           m
0  [1, 2, 3]
1  [5, 3, 2]
2     [2, 5]
3  [3, 8, 1]
4        [9]
5  [2, 6, 3]

I want to count the number of times a list v = [2, 3] is contained in the lists of the DataFrame. So in this example the correct answer would be 3. Now this is just an example, in my actual data the df['m'] can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of v contain no duplicates and neither do the lists of m, so they can be sets instead of lists.

The first iteration of my program iterated over each row and checked all(e in data['m'][i] for e in v) and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster.

So for my next iteration I added a column to the DataFrame that contains a copy of the list v:

>>> df['V'] = [[2, 3]] * len(df)
>>> df
        V          m
0  [2, 3]  [1, 2, 3]
1  [2, 3]  [5, 3, 2]
2  [2, 3]     [2, 5]
3  [2, 3]  [3, 8, 1]
4  [2, 3]        [9]
5  [2, 3]  [2, 6, 3]

and a helper function that simply returns the containment boolean like I did before:

def all_helper(l1, l2):
    return all(v in l1 for v in l2)

which I can then use with np.vectorize to add a column with the boolean value:

df['bool'] = np.vectorize(all_helper)(df['m'], df['V'])

And lastly, calculate the sum of these booleans with a simple df['bool'].sum()

I also tried to use .apply():

df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1)
count = df['bool'].sum()

but this was slower than the vectorisation.

Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if len(union(df['m'], df['V'])) == len(df['m']) or something, but I don't know how and if that's the best solution.

Edit: Since somebody asked; here's an example with strings instead of integers:

>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]})
>>> v = ["aa", "ac"]
>>> df
                    m
0  ["aa", "ab", "ac"]
1  ["aa", "ac", "ad"]
2        ["ba", "bb"]
3  ["ac", "ca", "cc"]
4              ["aa"]
5  ["ac", "da", "aa"]

>>> count_occurrence(df, v)
3

But if you want a more extensive DataFrame, you can generate it with this:

import string

n = 10000
df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]})
v = ["abc", 'cde']
print(count_occurrence(df, v))

Edit: Neither Divakar's or Vaishali's solution was faster than the one that uses np.vectorize. Wonder if anyone can beat it.

Jon Clements came with a solution that is roughly 30% faster and much cleaner: df.m.apply(set(v).issubset).sum(). I continue looking for faster implementations, but this is a step in the right direction.

like image 994
Jurgy Avatar asked Nov 21 '17 16:11

Jurgy


People also ask

How do you count occurrences of values in a column in pandas?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.

What is difference between count () and Value_counts?

count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.

What pandas function returns a series with the counts of each unique value in a column?

The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.

How do you count unique occurrences in pandas?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.


1 Answers

You can utilise DataFrame.apply along with the builtin set.issubset method and then .sum() which all operate at a lower level (normally C level) than Python equivalents do.

subset_wanted = {2, 3}
count = df.m.apply(subset_wanted.issubset).sum()

I can't see shaving more time off that than writing a custom C-level function which'd be the equivalent of a custom sum with a check there's a subset to determine 0/1 on a row by row basis. At which point, you could have run this thousands upon thousands of times anyway.

like image 129
Jon Clements Avatar answered Oct 17 '22 15:10

Jon Clements