Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the occurrence of values in one pandas Dataframe if the values to count are in another (in a faster way)?

I have a (really big) pandas Dataframe df:

country  age  gender
Brazil    10     F
USA       20     F 
Brazil    10     F
USA       20     M
Brazil    10     M
USA       20     M

I have another pandas Dataframe freq:

 age  gender  counting
  10       F         0
  10       M         0
  20       F         0

I wanna count the pair of values in freq when they occur in df:

 age  gender  counting
  10       F         2
  10       M         1
  20       F         1

I'm using this code, but it takes too long:

for row in df.itertuples(index=False):
   freq.loc[np.all(freq['age','gender']==row[2:3],axis=1),'counting'] += 1

Is there a faster way to do that?

Please note:

  • I have to use freq because not all combinations (as for instance 20 and M) are desired
  • some columns in df may not be used
  • counting counts how many times both values appear in each row
  • freq may have more than 2 values to check for (this is just an small example)
like image 381
Luiz Fernando Puttow Southier Avatar asked Jun 04 '20 18:06

Luiz Fernando Puttow Southier


People also ask

How do you count occurrences in pandas DataFrame?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.

How do you count how many times a specific value appears in a column pandas?

We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.

What pandas function returns a series with the counts of each unique value in a column?

The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.


3 Answers

you can do it with inner merge to filter the combinations in df you don't want, then groupby age and gender and count the column counting. just reset_index to fit your expected output.

freq = (df.merge(freq, on=['age', 'gender'], how='inner')
          .groupby(['age','gender'])['counting'].size()
          .reset_index())
print (freq)
   age gender  counting
0   10      F         2
1   10      M         1
2   20      F         1

Depending on the number of combinations you don't want, it could be faster to groupby on df before doing the merge like:

freq = (df.groupby(['age','gender']).size()
          .rename('counting').reset_index()
          .merge(freq[['age','gender']])
       )
like image 181
Ben.T Avatar answered Oct 23 '22 17:10

Ben.T


NumPy into the mix for some performance (hopefully!) with the idea of dimensionality-reduction to 1D, so that we can bring in the efficient bincount -

agec = np.r_[df.age,freq.age]
genderc = np.r_[df.gender,freq.gender]
aIDs,aU = pd.factorize(agec)
gIDs,gU = pd.factorize(genderc)
cIDs = aIDs*(gIDs.max()+1) + gIDs
count = np.bincount(cIDs[:len(df)], minlength=cIDs.max()+1)
freq['counting'] = count[cIDs[-len(freq):]]

Sample run -

In [44]: df
Out[44]: 
  country  age gender
0  Brazil   10      F
1     USA   20      F
2  Brazil   10      F
3     USA   20      M
4  Brazil   10      M
5     USA   20      M

In [45]: freq # introduced a missing element as the second row for variety
Out[45]: 
   age gender  counting
0   10      F         2
1   23      M         0
2   20      F         1

Specific scenario optimization #1

If age header is known to contain only integers, we can skip one factorize. So, skip aIDs,aU = pd.factorize(agec) and compute cIDs instead with -

cIDs = agec*(gIDs.max()+1) + gIDs
like image 32
Divakar Avatar answered Oct 23 '22 17:10

Divakar


Another way is to use reindex to filter down to freq list:

df.groupby(['gender', 'age']).count()\
  .reindex(pd.MultiIndex.from_arrays([df1['gender'], df1['age']]))

Output:

            country
gender age         
F      10         2
M      10         1
F      20         1
like image 8
Scott Boston Avatar answered Oct 23 '22 16:10

Scott Boston