I have a (really big) pandas Dataframe df: <pre class="prettyprint"><code>country age gender Brazil 10 F USA 20 F Brazil 10 F USA 20 M Brazil 10 M USA 20 M </code></pre> I have another pandas Dataframe freq: <pre class="prettyprint"><code> age gender counting 10 F 0 10 M 0 20 F 0 </code></pre> I wanna count the pair of values in freq when they occur in df: <pre class="prettyprint"><code> age gender counting 10 F 2 10 M 1 20 F 1 </code></pre> I'm using this code, but it takes too long: <pre class="prettyprint"><code>for row in df.itertuples(index=False): freq.loc[np.all(freq['age','gender']==row[2:3],axis=1),'counting'] += 1 </code></pre> Is there a faster way to do that? Please note: <ul> <li>I have to use freq because not all combinations (as for instance 20 and M) are desired</li> <li>some columns in df may not be used</li> <li>counting counts how many times both values appear in each row</li> <li>freq may have more than 2 values to check for (this is just an small example)</li> </ul>

NumPy into the mix for some performance (hopefully!) with the idea of dimensionality-reduction to <code>1D</code>, so that we can bring in the efficient <code>bincount</code> - <pre class="prettyprint"><code>agec = np.r_[df.age,freq.age] genderc = np.r_[df.gender,freq.gender] aIDs,aU = pd.factorize(agec) gIDs,gU = pd.factorize(genderc) cIDs = aIDs*(gIDs.max()+1) + gIDs count = np.bincount(cIDs[:len(df)], minlength=cIDs.max()+1) freq['counting'] = count[cIDs[-len(freq):]] </code></pre> Sample run - <pre class="prettyprint"><code>In [44]: df Out[44]: country age gender 0 Brazil 10 F 1 USA 20 F 2 Brazil 10 F 3 USA 20 M 4 Brazil 10 M 5 USA 20 M In [45]: freq # introduced a missing element as the second row for variety Out[45]: age gender counting 0 10 F 2 1 23 M 0 2 20 F 1 </code></pre> Specific scenario optimization #1 If <code>age</code> header is known to contain only integers, we can skip one <code>factorize</code>. So, skip <code>aIDs,aU = pd.factorize(agec)</code> and compute <code>cIDs</code> instead with - <pre class="prettyprint"><code>cIDs = agec*(gIDs.max()+1) + gIDs </code></pre>

How to count the occurrence of values in one pandas Dataframe if the values to count are in another (in a faster way)?

Tags:

performance

python

pandas

numpy

I have a (really big) pandas Dataframe df:

country  age  gender
Brazil    10     F
USA       20     F 
Brazil    10     F
USA       20     M
Brazil    10     M
USA       20     M

I have another pandas Dataframe freq:

 age  gender  counting
  10       F         0
  10       M         0
  20       F         0

I wanna count the pair of values in freq when they occur in df:

 age  gender  counting
  10       F         2
  10       M         1
  20       F         1

I'm using this code, but it takes too long:

for row in df.itertuples(index=False):
   freq.loc[np.all(freq['age','gender']==row[2:3],axis=1),'counting'] += 1

Is there a faster way to do that?

Please note:

I have to use freq because not all combinations (as for instance 20 and M) are desired
some columns in df may not be used
counting counts how many times both values appear in each row
freq may have more than 2 values to check for (this is just an small example)

381

asked Jun 04 '20 18:06

Luiz Fernando Puttow Southier

3 Answers

you can do it with inner merge to filter the combinations in df you don't want, then groupby age and gender and count the column counting. just reset_index to fit your expected output.

freq = (df.merge(freq, on=['age', 'gender'], how='inner')
          .groupby(['age','gender'])['counting'].size()
          .reset_index())
print (freq)
   age gender  counting
0   10      F         2
1   10      M         1
2   20      F         1

Depending on the number of combinations you don't want, it could be faster to groupby on df before doing the merge like:

freq = (df.groupby(['age','gender']).size()
          .rename('counting').reset_index()
          .merge(freq[['age','gender']])
       )

181

answered Oct 23 '22 17:10

Ben.T

NumPy into the mix for some performance (hopefully!) with the idea of dimensionality-reduction to 1D, so that we can bring in the efficient bincount -

agec = np.r_[df.age,freq.age]
genderc = np.r_[df.gender,freq.gender]
aIDs,aU = pd.factorize(agec)
gIDs,gU = pd.factorize(genderc)
cIDs = aIDs*(gIDs.max()+1) + gIDs
count = np.bincount(cIDs[:len(df)], minlength=cIDs.max()+1)
freq['counting'] = count[cIDs[-len(freq):]]

Sample run -

In [44]: df
Out[44]: 
  country  age gender
0  Brazil   10      F
1     USA   20      F
2  Brazil   10      F
3     USA   20      M
4  Brazil   10      M
5     USA   20      M

In [45]: freq # introduced a missing element as the second row for variety
Out[45]: 
   age gender  counting
0   10      F         2
1   23      M         0
2   20      F         1

Specific scenario optimization #1

If age header is known to contain only integers, we can skip one factorize. So, skip aIDs,aU = pd.factorize(agec) and compute cIDs instead with -

cIDs = agec*(gIDs.max()+1) + gIDs

answered Oct 23 '22 17:10

Divakar

Another way is to use reindex to filter down to freq list:

df.groupby(['gender', 'age']).count()\
  .reindex(pd.MultiIndex.from_arrays([df1['gender'], df1['age']]))

Output:

            country
gender age         
F      10         2
M      10         1
F      20         1

answered Oct 23 '22 16:10

Scott Boston

Related questions
                            
                                tox tests, use setup.py extra_require as tox deps source
                            
                                How to solve nan loss?
                            
                                Flask-wtf dynamic select field with an empty option
                            
                                Error when trying to import sklearn modules : ImportError: DLL load failed: The specified module could not be found
                            
                                Matplotlib: xticks every 15 minutes, starting on the hour
                            
                                Django DetailView - How to change the get_object to check a field
                            
                                Explicitly Define Datatype in Python Function
                            
                                How to use hyperopt for hyperparameter optimization of Keras deep learning network?
                            
                                Failed to load the native TensorFlow runtime. Python 3.5.2
                            
                                How to do text pre-processing using spaCy?
                            
                                ImportError: No module named scapy.all
                            
                                Combining Django F, Value and a dict to annotate a queryset
                            
                                How to pass a list by reference?
                            
                                Convert python byte "array" to int "array
                            
                                How to set the default option as -h for Python click?
                            
                                Running a Python script in Jupyter Notebook, with arguments passing
                            
                                Optional job parameter in AWS Glue?
                            
                                Why does a large for loop with 10 billion iterations take a much longer time to run in Python than in C?
                            
                                what is major difference between histogram,countplot and distplot in Seaborn library?
                            
                                Error in installing Matplotlib : fatal error C1083

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With