I have a problem that is similar to this question, but just different enough that it can't be solved with the same solution... I've got two dataframes, <code>df1</code> and <code>df2</code>, like this: <pre class="prettyprint"><code>import pandas as pd import numpy as np np.random.seed(42) names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice'] df1 = pd.DataFrame({'ID_a':np.random.choice(names, 20), 'ID_b':np.random.choice(names,20)}) df2 = pd.DataFrame({'ID':names}) >>> df1 ID_a ID_b 0 joe ben 1 ben jack 2 jane joe 3 ben jill 4 ben beatrice 5 jill ben 6 jane joe 7 jane jack 8 jane jack 9 ben jane 10 joe jane 11 jane jill 12 beatrice joe 13 ben joe 14 jill beatrice 15 joe beatrice 16 beatrice beatrice 17 beatrice jane 18 jill joe 19 joe joe >>> df2 ID 0 jack 1 jill 2 jane 3 joe 4 ben 5 beatrice </code></pre> What I'd like to do is add in a column to <code>df2</code>, with the count of rows in <code>df1</code> where the given name can be found in either column <code>ID_a</code> or <code>ID_b</code>, resulting in this: <pre class="prettyprint"><code>>>> df2 ID count 0 jack 3 1 jill 5 2 jane 8 3 joe 9 4 ben 7 5 beatrice 6 </code></pre> This loop gets what I need, but is inefficient for large dataframes, and if someone could suggest an alternative, nicer solution, I'd be very grateful: <pre class="prettyprint"><code>df2['count'] = 0 for idx,row in df2.iterrows(): df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)]) </code></pre> Thanks in advance!

The "either" part complicates things, but should still be doable. <hr> Option 1 Since other users decided to turn this into a speed-race, here's mine: <pre class="prettyprint"><code>from collections import Counter from itertools import chain c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist())) df2['count'] = df2['ID'].map(Counter(c)) df2 ID count 0 jack 3 1 jill 5 2 jane 8 3 joe 9 4 ben 7 5 beatrice 6 </code></pre> <pre class="prettyprint"><code>176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code></pre> <hr> Option 2 (Original answer) <code>stack</code> based <pre class="prettyprint"><code>c = df1.stack().groupby(level=0).value_counts().count(level=1) </code></pre> Or, <pre class="prettyprint"><code>c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts() </code></pre> Or, <pre class="prettyprint"><code>v = df1.stack() c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1) # c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1) </code></pre> And, <pre class="prettyprint"><code>df2['count'] = df2.ID.map(c) df2 ID count 0 jack 3 1 jill 5 2 jane 8 3 joe 9 4 ben 7 5 beatrice 6 </code></pre> <hr> Option 3 <code>repeat</code>-based Reshape and counting <pre class="prettyprint"><code>v = pd.DataFrame({ 'i' : df1.values.reshape(-1, ), 'j' : df1.index.repeat(2) }) c = v.loc[~v.duplicated(), 'i'].value_counts() df2['count'] = df2.ID.map(c) df2 ID count 0 jack 3 1 jill 5 2 jane 8 3 joe 9 4 ben 7 5 beatrice 6 </code></pre> <hr> Option 4 <code>concat</code> + <code>mask</code> <pre class="prettyprint"><code>v = pd.concat( [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0 ).value_counts() df2['count'] = df2.ID.map(v) df2 ID count 0 jack 3 1 jill 5 2 jane 8 3 joe 9 4 ben 7 5 beatrice 6 </code></pre>

Vectorized way to count occurrences of string in either of two columns

Tags:

python

string

pandas

dataframe

numpy

I have a problem that is similar to this question, but just different enough that it can't be solved with the same solution...

I've got two dataframes, df1 and df2, like this:

Click to copy

import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 20), 'ID_b':np.random.choice(names,20)})    
df2 = pd.DataFrame({'ID':names})

>>> df1
        ID_a      ID_b
0        joe       ben
1        ben      jack
2       jane       joe
3        ben      jill
4        ben  beatrice
5       jill       ben
6       jane       joe
7       jane      jack
8       jane      jack
9        ben      jane
10       joe      jane
11      jane      jill
12  beatrice       joe
13       ben       joe
14      jill  beatrice
15       joe  beatrice
16  beatrice  beatrice
17  beatrice      jane
18      jill       joe
19       joe       joe

>>> df2
         ID
0      jack
1      jill
2      jane
3       joe
4       ben
5  beatrice

What I'd like to do is add in a column to df2, with the count of rows in df1 where the given name can be found in either column ID_a or ID_b, resulting in this:

Click to copy

>>> df2
         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

This loop gets what I need, but is inefficient for large dataframes, and if someone could suggest an alternative, nicer solution, I'd be very grateful:

Click to copy

df2['count'] = 0

for idx,row in df2.iterrows():
    df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])

Thanks in advance!

518

asked Mar 21 '18 17:03

sacuL

1 Answers

The "either" part complicates things, but should still be doable.

Option 1
Since other users decided to turn this into a speed-race, here's mine:

Click to copy

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Click to copy

176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Option 2
(Original answer) stack based

Click to copy

c = df1.stack().groupby(level=0).value_counts().count(level=1)

Or,

Click to copy

c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()

Or,

Click to copy

v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)

And,

Click to copy

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Option 3
repeat-based Reshape and counting

Click to copy

v = pd.DataFrame({
        'i' : df1.values.reshape(-1, ), 
        'j' : df1.index.repeat(2)
    })
c = v.loc[~v.duplicated(), 'i'].value_counts()

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Option 4
concat + mask

Click to copy

v = pd.concat(
    [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()

df2['count'] = df2.ID.map(v)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

answered Sep 29 '22 13:09

cs95

Related questions
                            
                                How to find index of minimum non zero element with numpy?
                            
                                pyspark - create DataFrame Grouping columns in map type structure
                            
                                Print sample set of columns from dataframe in Pandas? [duplicate]
                            
                                Python: print variable name and value easily
                            
                                What is assigned to `variable`, in `with expression as variable`?
                            
                                Flask database migrations on heroku
                            
                                BeautifulSoup and class with spaces
                            
                                django.db.utils.IntegrityError: duplicate key value violates unique constraint "auth_permission_pkey"
                            
                                How to bind enter key to a tkinter button
                            
                                Why is a computation much slower within a Dask/Distributed worker?
                            
                                'function' object has no attribute 'assert_called_once_with'
                            
                                additional row colors in seaborn cluster map
                            
                                Python: Lib to use epoll if available, fallback to select
                            
                                Convert Google Vision API response to JSON
                            
                                Longest Common Subsequence in Python
                            
                                What's the difference between data time major and batch major?
                            
                                User input boolean in python
                            
                                Pandas split on regex
                            
                                map function run into infinite loop in 3.X
                            
                                How to open a Chrome Profile through Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Vectorized way to count occurrences of string in either of two columns

Tags:

python

string

pandas

dataframe

numpy

sacuL

People also ask

1 Answers

cs95

Recent Activity

Donate For Us