I have a problem that is similar to this question, but just different enough that it can't be solved with the same solution...
I've got two dataframes, df1
and df2
, like this:
import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 20), 'ID_b':np.random.choice(names,20)})
df2 = pd.DataFrame({'ID':names})
>>> df1
ID_a ID_b
0 joe ben
1 ben jack
2 jane joe
3 ben jill
4 ben beatrice
5 jill ben
6 jane joe
7 jane jack
8 jane jack
9 ben jane
10 joe jane
11 jane jill
12 beatrice joe
13 ben joe
14 jill beatrice
15 joe beatrice
16 beatrice beatrice
17 beatrice jane
18 jill joe
19 joe joe
>>> df2
ID
0 jack
1 jill
2 jane
3 joe
4 ben
5 beatrice
What I'd like to do is add in a column to df2
, with the count of rows in df1
where the given name can be found in either column ID_a
or ID_b
, resulting in this:
>>> df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
This loop gets what I need, but is inefficient for large dataframes, and if someone could suggest an alternative, nicer solution, I'd be very grateful:
df2['count'] = 0
for idx,row in df2.iterrows():
df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
Thanks in advance!
We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
One of the built-in ways in which you can use Python to count the number of occurrences in a string is using the built-in string . count() method. The method takes one argument, either a character or a substring, and returns the number of times that character exists in the string associated with the method.
The count() method returns the number of times a specified value appears in the string.
The "either" part complicates things, but should still be doable.
Option 1
Since other users decided to turn this into a speed-race, here's mine:
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Option 2
(Original answer) stack
based
c = df1.stack().groupby(level=0).value_counts().count(level=1)
Or,
c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()
Or,
v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)
And,
df2['count'] = df2.ID.map(c)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
Option 3repeat
-based Reshape and counting
v = pd.DataFrame({
'i' : df1.values.reshape(-1, ),
'j' : df1.index.repeat(2)
})
c = v.loc[~v.duplicated(), 'i'].value_counts()
df2['count'] = df2.ID.map(c)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
Option 4concat
+ mask
v = pd.concat(
[df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()
df2['count'] = df2.ID.map(v)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With