Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe - Bin on multiple columns & get statistics on another column

Problem

I have a target variable x and some additional variables A and B. I want to calculate averages (and other statistics) of x when certain conditions for A and B are met. A real world example would be to calculate the average air temperature (x) from a long series of measurements when solar radiation (A) and wind speed (B) fall into certain pre-defined bin ranges.

Potential solutions

I have been able to accomplish this with loops (see example below), but I've learned that I should avoid looping over dataframes. From my research on this site I feel like there is probably a much more elegant / vectorized solution using either pd.cut or np.select, but I frankly couldn't figure out how to do it.

Example

Generate sample data

import pandas as pd
import numpy as np

n = 100
df = pd.DataFrame(
    {
        "x": np.random.randn(n),
        "A": np.random.randn(n)+5,
        "B": np.random.randn(n)+10
    }
)

df.head() output:

    x           A           B
0   -0.585313   6.038620    9.909762
1   0.412323    3.991826    8.836848
2   0.211713    5.019520    9.667349
3   0.710699    5.353677    9.757903
4   0.681418    4.452754    10.647738

Calculate bin averages

# define bin ranges
bins_A = np.arange(3, 8)
bins_B = np.arange(8, 13)

# prepare output lists
A_mins= []
A_maxs= []
B_mins= []
B_maxs= []
x_means= []
x_stds= []
x_counts= []

# loop over bins
for i_A in range(0, len(bins_A)-1):
    A_min = bins_A[i_A]
    A_max = bins_A[i_A+1]
    for i_B in range(0, len(bins_B)-1):
        B_min = bins_B[i_B]
        B_max = bins_B[i_B+1]
        
        # binning conditions for current step
        conditions = np.logical_and.reduce(
            [
                df["A"] > A_min,
                df["A"] < A_max,
                df["B"] > B_min,
                df["B"] < B_max,
            ]
        )
        
        # calculate statistics for x and store values in lists
        x_values = df.loc[conditions, "x"]
        x_means.append(x_values.mean())
        x_stds.append(x_values.std())
        x_counts.append(x_values.count())

        A_mins.append(A_min)
        A_maxs.append(A_max)
        B_mins.append(B_min)
        B_maxs.append(B_max)
        

Store the result in a new dataframe

binned = pd.DataFrame(
    data={
        "A_min": A_mins,
        "A_max": A_maxs,
        "B_min": B_mins,
        "B_max": B_maxs,
        "x_mean": x_means,
        "x_std": x_stds,
        "x_count": x_counts 
        }
)

binned.head() output:

    A_min   A_max   B_min   B_max   x_mean      x_std       x_count
0   3       4       8       9       0.971624    0.790972    2
1   3       4       9       10      0.302795    0.380102    3
2   3       4       10      11      0.447398    1.787659    5
3   3       4       11      12      0.462149    1.195844    2
4   4       5       8       9       0.379431    0.983965    4
like image 582
Fred S Avatar asked Feb 21 '20 13:02

Fred S


3 Answers

Approach #1 : Pandas + NumPy (some to none)

We will try to keep it to pandas/NumPy so that we could leverage dataframe methods or array methods and ufuncs, while vectorizing it at their level. This makes it easier to extend the functionalities when complex problems are to be solved or statistics are to be generated, as that seems to be case here.

Now, to solve the problem while keeping it close to pandas, would be to generate intermediate IDs or tags that resemble the combined tracking of A and B on the given bins bins_A and bins_B respectively. To do so, one way would be to use searchsorted on these two data separately -

tagsA = np.searchsorted(bins_A,df.A)
tagsB = np.searchsorted(bins_B,df.B)

Now, we are interested in the within-the-bounds cases only, hence masking is needed -

vm = (tagsB>0) & (tagsB<len(bins_B)) & (tagsA>0) & (tagsA<len(bins_A))

Let's apply this mask on the original dataframe -

dfm = df.iloc[vm]

Add in the tags for the valid ones, which would represent A_mins and B_min equivalents and hence would show up in the final output -

dfm['TA'] = bins_A[(tagsA-1)[vm]]
dfm['TB'] = bins_B[(tagsB-1)[vm]]

So, our tagged dataframe is ready, which could then be describe-d to get the common stats after grouping on these two tags -

df_out = dfm.groupby(['TA','TB'])['x'].describe()

Sample run to make things clearer, while comparing against posted solution in question -

In [46]: np.random.seed(0)
    ...: n = 100
    ...: df = pd.DataFrame(
    ...:     {
    ...:         "x": np.random.randn(n),
    ...:         "A": np.random.randn(n)+5,
    ...:         "B": np.random.randn(n)+10
    ...:     }
    ...: )

In [47]: binned
Out[47]: 
    A_min  A_max  B_min  B_max    x_mean     x_std  x_count
0       3      4      8      9  0.400199  0.719007        5
1       3      4      9     10 -0.268252  0.914784        6
2       3      4     10     11  0.458746  1.499419        5
3       3      4     11     12  0.939782  0.055092        2
4       4      5      8      9  0.238318  1.173704        5
5       4      5      9     10 -0.263020  0.815974        8
6       4      5     10     11 -0.449831  0.682148       12
7       4      5     11     12 -0.273111  1.385483        2
8       5      6      8      9 -0.438074       NaN        1
9       5      6      9     10 -0.009721  1.401260       16
10      5      6     10     11  0.467934  1.221720       11
11      5      6     11     12  0.729922  0.789260        3
12      6      7      8      9 -0.977278       NaN        1
13      6      7      9     10  0.211842  0.825401        7
14      6      7     10     11 -0.097307  0.427639        5
15      6      7     11     12  0.915971  0.195841        2

In [48]: df_out
Out[48]: 
       count      mean       std  ...       50%       75%       max
TA TB                             ...                              
3  8     5.0  0.400199  0.719007  ...  0.302472  0.976639  1.178780
   9     6.0 -0.268252  0.914784  ... -0.001510  0.401796  0.653619
   10    5.0  0.458746  1.499419  ...  0.462782  1.867558  1.895889
   11    2.0  0.939782  0.055092  ...  0.939782  0.959260  0.978738
4  8     5.0  0.238318  1.173704  ... -0.212740  0.154947  2.269755
   9     8.0 -0.263020  0.815974  ... -0.365103  0.449313  0.950088
   10   12.0 -0.449831  0.682148  ... -0.436773 -0.009697  0.761038
   11    2.0 -0.273111  1.385483  ... -0.273111  0.216731  0.706573
5  8     1.0 -0.438074       NaN  ... -0.438074 -0.438074 -0.438074
   9    16.0 -0.009721  1.401260  ...  0.345020  1.284173  1.950775
   10   11.0  0.467934  1.221720  ...  0.156349  1.471263  2.240893
   11    3.0  0.729922  0.789260  ...  1.139401  1.184846  1.230291
6  8     1.0 -0.977278       NaN  ... -0.977278 -0.977278 -0.977278
   9     7.0  0.211842  0.825401  ...  0.121675  0.398750  1.764052
   10    5.0 -0.097307  0.427639  ... -0.103219  0.144044  0.401989
   11    2.0  0.915971  0.195841  ...  0.915971  0.985211  1.054452

So, as mentioned earlier, we have our A_min and B_min in TA and TB, while the relevant statistics are captured in other headers. Note that this would be a multi-index dataframe. If we need to capture the equivalent array data, simply do : df_out.loc[:,['count','mean','std']].values for the stats, while np.vstack(df_out.loc[:,['count','mean','std']].index) for the bin interval-starts.

Alternatively, to capture the equivalent stat data without describe, but using dataframe methods, we can do something like this -

dfmg = dfm.groupby(['TA','TB'])['x']
dfmg.size().unstack().values
dfmg.std().unstack().values
dfmg.mean().unstack().values

Alternative #1 : Using pd.cut

We can also use pd.cut as was suggested in the question to replace searchsorted for a more compact one as the out-of-bounds ones are handled automatically, keeping the basic idea same -

df['TA'] = pd.cut(df['A'],bins=bins_A, labels=range(len(bins_A)-1))
df['TB'] = pd.cut(df['B'],bins=bins_B, labels=range(len(bins_B)-1))
df_out = df.groupby(['TA','TB'])['x'].describe()

So, this gives us the stats. For A_min and B_min equivalents, simply use the index levels -

A_min = bins_A[df_out.index.get_level_values(0)]
B_min = bins_B[df_out.index.get_level_values(1)]

Or use some meshgrid method -

mA,mB = np.meshgrid(bins_A[:-1],bins_B[:-1])
A_min,B_min = mA.ravel('F'),mB.ravel('F')

Approach #2 : With bincount

We can leverage np.bincount to get all those three stat metric values including standard-deviation, again in a vectorized manner -

lA,lB = len(bins_A),len(bins_B)
n = lA+1

x,A,B = df.x.values,df.A.values,df.B.values

tagsA = np.searchsorted(bins_A,A)
tagsB = np.searchsorted(bins_B,B)

t = tagsB*n + tagsA

L = n*lB

countT = np.bincount(t, minlength=L)
countT_x = np.bincount(t,x, minlength=L)
avg_all = countT_x/countT
count = countT.reshape(-1,n)[1:,1:-1].ravel('F')
avg = avg_all.reshape(-1,n)[1:,1:-1].ravel('F')

# Using numpy std definition for ddof case
ddof = 1.0 # default one for pandas std
grp_diffs = (x-avg_all[t])**2
std_all = np.sqrt(np.bincount(t,grp_diffs, minlength=L)/(countT-ddof))
stds = std_all.reshape(-1,n)[1:,1:-1].ravel('F')

Approach #3 : With sorting to leverage reduceat methods -

x,A,B = df.x.values,df.A.values,df.B.values
vm = (A>bins_A[0]) & (A<bins_A[-1]) & (B>bins_B[0]) & (B<bins_B[-1])

xm = x[vm]

tagsA = np.searchsorted(bins_A,A)
tagsB = np.searchsorted(bins_B,B)

tagsAB = tagsB*(tagsA.max()+1) + tagsA
tagsABm = tagsAB[vm]
sidx = tagsABm.argsort()
tagsAB_s = tagsABm[sidx]
xms = xm[sidx]

cut_idx = np.flatnonzero(np.r_[True,tagsAB_s[:-1]!=tagsAB_s[1:],True])
N = (len(bins_A)-1)*(len(bins_B)-1)

count = np.diff(cut_idx)
avg = np.add.reduceat(xms,cut_idx[:-1])/count
stds = np.empty(N)
for ii,(s0,s1) in enumerate(zip(cut_idx[:-1],cut_idx[1:])):
    stds[ii] = np.std(xms[s0:s1], ddof=1)

To get in the same or similar format as the pandas dataframe styled output, we need to reshape. Hence, it would be avg.reshape(-1,len(bins_A)-1).T and so on.

like image 78
Divakar Avatar answered Oct 22 '22 22:10

Divakar


If what you are concerned is about performance you can use your for loops with minor changes if you use numba

Here you have a function that does the calculations. The key is that the calculate uses numba so it is really fast. The rest is only for crating a pandas dataframe:

from numba import njit

def calc_numba(df, bins_A, bins_B):
    """ wrapper for the timeit. It only creates a dataframe """

    @njit
    def calculate(A, B, x, bins_A, bins_B):

        size = (len(bins_A) - 1)*(len(bins_B) - 1)
        out = np.empty((size, 7))

        index = 0
        for i_A, A_min in enumerate(bins_A[:-1]):
            A_max = bins_A[i_A + 1]

            for i_B, B_min in enumerate(bins_B[:-1]):
                B_max = bins_B[i_B + 1]

                mfilter = (A_min < A)*(A < A_max)*(B_min < B)*(B < B_max)
                x_values = x[mfilter]

                out[index, :] = [
                    A_min,
                    A_max,
                    B_min,
                    B_max,
                    x_values.mean(),
                    x_values.std(),
                    len(x_values)
                ]

                index += 1

        return out

    columns = ["A_min", "A_max", "B_min", "B_max", "mean", "std", "count"]
    out = calculate(df["A"].values, df["B"].values, df["x"].values, bins_A, bins_B)
    return pd.DataFrame(out, columns=columns)

Performance test

Using n = 1_000_000 and the same bins_A and bins_B we get:

%timeit code_question(df, bins_A, bins_B)
15.7 s ± 428 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit calc_numba(df, bins_A, bins_B)
507 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It is around 30 faster than the code from the question

It will be really hard to beat the numba performance since pandas builtin methods uses similar enhancements.

like image 41
villoro Avatar answered Oct 22 '22 22:10

villoro


Here is a short solution using only Numpy and pandas. It is certainly not the most efficient way, but I guess the most straighforward and easy to understand way.

import pandas as pd
import numpy as np


n = 20
df = pd.DataFrame(
    {
        "x": np.random.randn(n),
        "A": np.random.randn(n)+5,
        "B": np.random.randn(n)+10
    }
)

# define bin ranges
bins_A = np.arange(3, 8)
bins_B = np.arange(8, 13)

Till here I use your example. Then I'm introducing the lower and higher Bin edges using numpy

A_mins=bins_A[:-1]
A_maxs=bins_A[1:]
B_mins=bins_B[:-1]
B_maxs=bins_B[1:]

Putting this together in a way, that actually you were using those nested loops for, I am limiting myself to numpy, where I can still maintain exactly the structure, that you would get with the nested loops.

A_mins_list=np.repeat(A_mins,len(B_mins))
A_maxs_list=np.repeat(A_maxs,len(B_mins))
B_mins_list=np.tile(B_mins,len(A_mins))
B_maxs_list=np.tile(B_maxs,len(A_mins))

The new dataframe is initialized with the bin information.

newdf=pd.DataFrame(np.array([A_mins_list,A_maxs_list,B_mins_list,B_maxs_list]).T,columns=['Amin','Amax','Bmin','Bmax'])

The xvalues column is the most wicked one here, since I have to make it a numpy array to fit in the dataframe. This sub-array is then a numpy array and further has to be treated as one. Keep that in mind, as some pandas functions might not work on that; it has to be a numpy function in some cases.

newdf['xvalues']=newdf.apply(lambda row:np.array(df.x[(row.Amin<df.A) & (row.Amax>df.A) & (row.Bmin<df.B) & (row.Bmax>df.B)]),axis=1)

Furthermore, you can do whatever you want with lambda functions. As I said, maybe not the most efficient way of doing it, but the code is somewhat clear and as long as you don't need the highest performance as needed for dataframes of millions of entries, this code is easily extended by

newdf['xmeans']=newdf.apply(lambda row: row.xvalues.mean(),axis=1)
newdf['stds']=newdf.apply(lambda row: row.xvalues.std(),axis=1)
newdf['xcounts']=newdf.apply(lambda row: row.xvalues.size,axis=1)

or whatever you might like.

Using cython, the performance could be improved significantly by avoiding the lambda-way, but I am not used to cython so I rather leave that to experts...

Additionally please note, that there might be some warnings being raised, if you're trying to take a mean of an empty array or std of just one value. If wanted, those could be suppressed using the warning package.

like image 23
Lepakk Avatar answered Oct 22 '22 23:10

Lepakk