I have a (really big) pandas Dataframe df:
country age gender
Brazil 10 F
USA 20 F
Brazil 10 F
USA 20 M
Brazil 10 M
USA 20 M
I have another pandas Dataframe freq:
age gender counting
10 F 0
10 M 0
20 F 0
I wanna count the pair of values in freq when they occur in df:
age gender counting
10 F 2
10 M 1
20 F 1
I'm using this code, but it takes too long:
for row in df.itertuples(index=False):
freq.loc[np.all(freq['age','gender']==row[2:3],axis=1),'counting'] += 1
Is there a faster way to do that?
Please note:
Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.
We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.
you can do it with inner merge
to filter the combinations in df you don't want, then groupby
age and gender and count
the column counting. just reset_index to fit your expected output.
freq = (df.merge(freq, on=['age', 'gender'], how='inner')
.groupby(['age','gender'])['counting'].size()
.reset_index())
print (freq)
age gender counting
0 10 F 2
1 10 M 1
2 20 F 1
Depending on the number of combinations you don't want, it could be faster to groupby
on df
before doing the merge
like:
freq = (df.groupby(['age','gender']).size()
.rename('counting').reset_index()
.merge(freq[['age','gender']])
)
NumPy into the mix for some performance (hopefully!) with the idea of dimensionality-reduction to 1D
, so that we can bring in the efficient bincount
-
agec = np.r_[df.age,freq.age]
genderc = np.r_[df.gender,freq.gender]
aIDs,aU = pd.factorize(agec)
gIDs,gU = pd.factorize(genderc)
cIDs = aIDs*(gIDs.max()+1) + gIDs
count = np.bincount(cIDs[:len(df)], minlength=cIDs.max()+1)
freq['counting'] = count[cIDs[-len(freq):]]
Sample run -
In [44]: df
Out[44]:
country age gender
0 Brazil 10 F
1 USA 20 F
2 Brazil 10 F
3 USA 20 M
4 Brazil 10 M
5 USA 20 M
In [45]: freq # introduced a missing element as the second row for variety
Out[45]:
age gender counting
0 10 F 2
1 23 M 0
2 20 F 1
Specific scenario optimization #1
If age
header is known to contain only integers, we can skip one factorize
. So, skip aIDs,aU = pd.factorize(agec)
and compute cIDs
instead with -
cIDs = agec*(gIDs.max()+1) + gIDs
Another way is to use reindex
to filter down to freq list:
df.groupby(['gender', 'age']).count()\
.reindex(pd.MultiIndex.from_arrays([df1['gender'], df1['age']]))
Output:
country
gender age
F 10 2
M 10 1
F 20 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With