Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use generators in Pandas

I'm learning to use generators but don't quite understand how they work.

What I want to do is iterate over rows and multiply a cell by another cell in each row, then create a new column with the results.

rate = (df['Fee'][i] for df['Fee'] in df / df['Costs'][i] for df['Costs'] in df * 100)

df['rate']=df.iterrows(rate)

So above, I've tried to make a generator which calculates what the percentage the fee is from the costs.

I realise this would be much easier with a for loop but I wanted to learn how a generator would be used in this instance.

Example dataframe below.

          Industry  Expr1        Fee        Costs
      Food & Drink   June   9970.320    116171.15
    Music Industry   June   7255.534    131492.59
     Manufacturing   June   5278.960    171315.01
    Music Industry   June   6120.596    143688.78
Telecommunications  April   4123.986     78733.09
like image 360
Iwan Avatar asked Sep 06 '25 03:09

Iwan


1 Answers

The succinct answer is "You don't". Or as the Pandas documentation puts it:

When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can also be passed into most NumPy methods expecting an ndarray.

This also applies to DataFrames and many other structures that leverage ndarray. For more insight I would really recommend learning more about how pandas/NumPy/SciPy work internally.

Regarding this particular topic I would point you to Pandas - Intro to Data Structures - Data Alignment and Arithmetic and NumPy - Broadcasting

Behind the scenes these packages use a lot of C code to optimize operations. While generators/iterators are great they will never be able to match such optimized code. For example, given your problem example here is a simple test.

np.all((df.Fee / df.Costs).values == np.array([x / y for x, y in df[['Fee', 'Costs']].values]))
True

%timeit (df.Fee / df.Costs).values
78.5 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.array([x / y for x, y in df[['Fee', 'Costs']].values])
331 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As you can see the built in method of division used internally by Pandas is ~ 5x faster. And that is on a terribly small sample size.

like image 136
Grr Avatar answered Sep 07 '25 16:09

Grr