I have this Pandas DataFrame that has a column with lists: <pre class="prettyprint"><code>>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]}) >>> df m 0 [1, 2, 3] 1 [5, 3, 2] 2 [2, 5] 3 [3, 8, 1] 4 [9] 5 [2, 6, 3] </code></pre> I want to count the number of times a list <code>v = [2, 3]</code> is contained in the lists of the DataFrame. So in this example the correct answer would be <code>3</code>. Now this is just an example, in my actual data the <code>df['m']</code> can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of <code>v</code> contain no duplicates and neither do the lists of <code>m</code>, so they can be sets instead of lists. The first iteration of my program iterated over each row and checked <code>all(e in data['m'][i] for e in v)</code> and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster. So for my next iteration I added a column to the DataFrame that contains a copy of the list <code>v</code>: <pre class="prettyprint"><code>>>> df['V'] = [[2, 3]] * len(df) >>> df V m 0 [2, 3] [1, 2, 3] 1 [2, 3] [5, 3, 2] 2 [2, 3] [2, 5] 3 [2, 3] [3, 8, 1] 4 [2, 3] [9] 5 [2, 3] [2, 6, 3] </code></pre> and a helper function that simply returns the containment boolean like I did before: <pre class="prettyprint"><code>def all_helper(l1, l2): return all(v in l1 for v in l2) </code></pre> which I can then use with <code>np.vectorize</code> to add a column with the boolean value: <pre class="prettyprint"><code>df['bool'] = np.vectorize(all_helper)(df['m'], df['V']) </code></pre> And lastly, calculate the sum of these booleans with a simple <code>df['bool'].sum()</code> I also tried to use <code>.apply()</code>: <pre class="prettyprint"><code>df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1) count = df['bool'].sum() </code></pre> but this was slower than the vectorisation. Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if <code>len(union(df['m'], df['V'])) == len(df['m'])</code> or something, but I don't know how and if that's the best solution. Edit: Since somebody asked; here's an example with strings instead of integers: <pre class="prettyprint"><code>>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]}) >>> v = ["aa", "ac"] >>> df m 0 ["aa", "ab", "ac"] 1 ["aa", "ac", "ad"] 2 ["ba", "bb"] 3 ["ac", "ca", "cc"] 4 ["aa"] 5 ["ac", "da", "aa"] >>> count_occurrence(df, v) 3 </code></pre> But if you want a more extensive DataFrame, you can generate it with this: <pre class="prettyprint"><code>import string n = 10000 df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]}) v = ["abc", 'cde'] print(count_occurrence(df, v)) </code></pre> Edit: Neither Divakar's or Vaishali's solution was faster than the one that uses <code>np.vectorize</code>. Wonder if anyone can beat it. Jon Clements came with a solution that is roughly 30% faster and much cleaner: <code>df.m.apply(set(v).issubset).sum()</code>. I continue looking for faster implementations, but this is a step in the right direction.

You can utilise <code>DataFrame.apply</code> along with the builtin <code>set.issubset</code> method and then <code>.sum()</code> which all operate at a lower level (normally C level) than Python equivalents do. <pre class="prettyprint"><code>subset_wanted = {2, 3} count = df.m.apply(subset_wanted.issubset).sum() </code></pre> I can't see shaving more time off that than writing a custom C-level function which'd be the equivalent of a custom <code>sum</code> with a check there's a subset to determine 0/1 on a row by row basis. At which point, you could have run this thousands upon thousands of times anyway.

Pandas counting occurrence of list contained in column of lists

Tags:

python

pandas

vectorization

I have this Pandas DataFrame that has a column with lists:

>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]})
>>> df
           m
0  [1, 2, 3]
1  [5, 3, 2]
2     [2, 5]
3  [3, 8, 1]
4        [9]
5  [2, 6, 3]

I want to count the number of times a list v = [2, 3] is contained in the lists of the DataFrame. So in this example the correct answer would be 3. Now this is just an example, in my actual data the df['m'] can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of v contain no duplicates and neither do the lists of m, so they can be sets instead of lists.

The first iteration of my program iterated over each row and checked all(e in data['m'][i] for e in v) and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster.

So for my next iteration I added a column to the DataFrame that contains a copy of the list v:

>>> df['V'] = [[2, 3]] * len(df)
>>> df
        V          m
0  [2, 3]  [1, 2, 3]
1  [2, 3]  [5, 3, 2]
2  [2, 3]     [2, 5]
3  [2, 3]  [3, 8, 1]
4  [2, 3]        [9]
5  [2, 3]  [2, 6, 3]

and a helper function that simply returns the containment boolean like I did before:

def all_helper(l1, l2):
    return all(v in l1 for v in l2)

which I can then use with np.vectorize to add a column with the boolean value:

df['bool'] = np.vectorize(all_helper)(df['m'], df['V'])

And lastly, calculate the sum of these booleans with a simple df['bool'].sum()

I also tried to use .apply():

df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1)
count = df['bool'].sum()

but this was slower than the vectorisation.

Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if len(union(df['m'], df['V'])) == len(df['m']) or something, but I don't know how and if that's the best solution.

Edit: Since somebody asked; here's an example with strings instead of integers:

>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]})
>>> v = ["aa", "ac"]
>>> df
                    m
0  ["aa", "ab", "ac"]
1  ["aa", "ac", "ad"]
2        ["ba", "bb"]
3  ["ac", "ca", "cc"]
4              ["aa"]
5  ["ac", "da", "aa"]

>>> count_occurrence(df, v)
3

But if you want a more extensive DataFrame, you can generate it with this:

import string

n = 10000
df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]})
v = ["abc", 'cde']
print(count_occurrence(df, v))

Edit: Neither Divakar's or Vaishali's solution was faster than the one that uses np.vectorize. Wonder if anyone can beat it.

Jon Clements came with a solution that is roughly 30% faster and much cleaner: df.m.apply(set(v).issubset).sum(). I continue looking for faster implementations, but this is a step in the right direction.

994

asked Nov 21 '17 16:11

Jurgy

1 Answers

You can utilise DataFrame.apply along with the builtin set.issubset method and then .sum() which all operate at a lower level (normally C level) than Python equivalents do.

subset_wanted = {2, 3}
count = df.m.apply(subset_wanted.issubset).sum()

I can't see shaving more time off that than writing a custom C-level function which'd be the equivalent of a custom sum with a check there's a subset to determine 0/1 on a row by row basis. At which point, you could have run this thousands upon thousands of times anyway.

129

answered Oct 17 '22 15:10

Jon Clements

Related questions
                            
                                python looping and creating new dataframe for each value of a column
                            
                                Overlapping axis tick labels in logarithmic plots
                            
                                How to install regular python (via homebrew) and miniconda in the same computer?
                            
                                python , opencv, image array to binary
                            
                                Django Rest Framework - OPTIONS request - Get foreign key choices
                            
                                Any limitations on platform constraints for wheels on PyPI?
                            
                                Is there a callable equivalent to f-string syntax?
                            
                                Poisson Regression in xgboost Fails for Low Frequencies
                            
                                Populate second dropdown based on the value selected in the first dropdown in flask using ajax and jQuery
                            
                                Google PubSub python client returning StatusCode.UNAVAILABLE
                            
                                How do you ensure a Celery chord callback gets called with failed subtasks?
                            
                                Set the HTTP status text in a Flask response
                            
                                Element disappears when I add an {% include %} tag inside my for loop
                            
                                URL path parameters vs query parameters in Django
                            
                                Python Error When Installing ez_setup.py "could not create SSL/TLS secure channel"
                            
                                Not clicking all tabs and not looping once issues
                            
                                Pygame - Loading images in sprites
                            
                                Matplotlib path.contains_points returns false for points on some edges but not others
                            
                                Pandas manipulating a DataFrame inplace vs not inplace (inplace=True vs False) [duplicate]
                            
                                Chaining string operations on Pandas Series

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With