Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most frequent value using pandas.DataFrame.resample

I am using pandas.DataFrame.resample to resample a grouped Pandas dataframe with a timestamp index.

In one of the columns, I would like to resample such that I select the most frequent value. At the moment, I am only having success using NumPy functions like np.max or np.sum etc.

#generate test dataframe
data = np.random.randint(0,10,(366,2))
index = pd.date_range(start=pd.Timestamp('1-Dec-2012'), periods=366, unit='D')
test = pd.DataFrame(data, index=index)

#generate group array
group =  np.random.randint(0,2,(366,))

#define how dictionary for resample
how_dict = {0: np.max, 1: np.min}

#perform grouping and resample
test.groupby(group).resample('48 h',how=how_dict)

The previous code works because I have used NumPy functions. However, if I want to use resample by most frequent value, I am not sure. I try defining a custom function like

def frequent(x):
    (value, counts) = np.unique(x, return_counts=True)
    return value[counts.argmax()]

However, if I now do:

how_dict = {0: np.max, 1: frequent}

I get an empty dataframe...

df = test.groupby(group).resample('48 h',how=how_dict)
df.shape
like image 229
Francesco Avatar asked Apr 06 '16 18:04

Francesco


1 Answers

Your resample period is too short, so when a group is empty on a period, your user function raise a ValueError not kindly caught by pandas .

But it works without empty groups, for example with regular groups:

In [8]: test.groupby(arange(366)%2).resample('48h',how=how_dict).head()
Out[8]: 
              0  1
0 2012-12-01  4  8
  2012-12-03  0  3
  2012-12-05  9  5
  2012-12-07  3  4
  2012-12-09  7  3

Or with bigger periods :

In [9]: test.groupby(group).resample('122D',how=how_dict)
Out[9]: 
              0  1
0 2012-12-02  9  0
  2013-04-03  9  0
  2013-08-03  9  6
1 2012-12-01  9  3
  2013-04-02  9  7
  2013-08-02  9  1

EDIT

A workaround can be to manage the empty case :

def frequent(x):
    if len(x)==0 : return -1
    (value, counts) = np.unique(x, return_counts=True)
    return value[counts.argmax()]

For

In [11]: test.groupby(group).resample('48h',how=how_dict).head()
Out[11]: 
               0  1
0 2012-12-01   5  3
  2012-12-03   3  4
  2012-12-05 NaN -1
  2012-12-07   5  0
  2012-12-09   1  4
like image 175
B. M. Avatar answered Oct 16 '22 11:10

B. M.