Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized way to count occurrences of string in either of two columns

I have a problem that is similar to this question, but just different enough that it can't be solved with the same solution...

I've got two dataframes, df1 and df2, like this:

import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 20), 'ID_b':np.random.choice(names,20)})    
df2 = pd.DataFrame({'ID':names})

>>> df1
        ID_a      ID_b
0        joe       ben
1        ben      jack
2       jane       joe
3        ben      jill
4        ben  beatrice
5       jill       ben
6       jane       joe
7       jane      jack
8       jane      jack
9        ben      jane
10       joe      jane
11      jane      jill
12  beatrice       joe
13       ben       joe
14      jill  beatrice
15       joe  beatrice
16  beatrice  beatrice
17  beatrice      jane
18      jill       joe
19       joe       joe

>>> df2
         ID
0      jack
1      jill
2      jane
3       joe
4       ben
5  beatrice

What I'd like to do is add in a column to df2, with the count of rows in df1 where the given name can be found in either column ID_a or ID_b, resulting in this:

>>> df2
         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

This loop gets what I need, but is inefficient for large dataframes, and if someone could suggest an alternative, nicer solution, I'd be very grateful:

df2['count'] = 0

for idx,row in df2.iterrows():
    df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])

Thanks in advance!

like image 518
sacuL Avatar asked Mar 21 '18 17:03

sacuL


People also ask

How do you count occurrences in a Dataframe column?

We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.

How do I count the number of times a string appears in a column pandas?

To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

How do you count occurrences in a string?

One of the built-in ways in which you can use Python to count the number of occurrences in a string is using the built-in string . count() method. The method takes one argument, either a character or a substring, and returns the number of times that character exists in the string associated with the method.

How do you count how many times a value appears in a string?

The count() method returns the number of times a specified value appears in the string.


1 Answers

The "either" part complicates things, but should still be doable.


Option 1
Since other users decided to turn this into a speed-race, here's mine:

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Option 2
(Original answer) stack based

c = df1.stack().groupby(level=0).value_counts().count(level=1)

Or,

c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()

Or,

v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)

And,

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Option 3
repeat-based Reshape and counting

v = pd.DataFrame({
        'i' : df1.values.reshape(-1, ), 
        'j' : df1.index.repeat(2)
    })
c = v.loc[~v.duplicated(), 'i'].value_counts()

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Option 4
concat + mask

v = pd.concat(
    [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()

df2['count'] = df2.ID.map(v)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6
like image 68
cs95 Avatar answered Sep 29 '22 13:09

cs95