Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the mode for string variable when resampling with pandas

I am trying to resample a pandas data frame with a timestamp index to an hourly occurrence. I am interested in obtaining the most frequent value for a column with string values . However the built in functions of time series resampling do not include mode as one of the default methods to resample (as it does 'mean' and 'count').
I tried to define my own function and to pass that function but is not working. I've also tried using the np.bincount function but it does not work since I am working with strings.

Here is how my data looks:

                   station_arrived action     lat1     lon1
date_removed
2012-01-01 13:12:00     56             A     19.4171 -99.16561   
2012-01-01 13:12:00     56             A     19.4271 -99.16361 
2012-01-01 15:41:00     56             A     19.4171 -99.16561 
2012-01-02 08:41:00     56             C     19.4271 -99.16561 
2012-01-02 11:36:00     56             C     19.2171 -99.16561

This is my code so far:

def mode1(algo):
    common=[ite for ite, it in Counter(algo).most_common(1)]
    # Returns all unique items and their counts
    return common

hourlycount2 = travels2012.resample('H', how={'station_arrived': 'count',
                                              'action': mode(travels2012['action']),
                                              'lat1':'count', 'lon1':'count'})

hourlycount2.head()

I see the following error:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\generic.py", line 2836, in resample
    return sampler.resample(self).__finalize__(self)
  File "C:\Program Files\Anaconda\lib\site-packages\pandas\tseries\resample.py", line 83, in resample
    rs = self._resample_timestamps()
  File "C:\Program Files\Anaconda\lib\site-packages\pandas\tseries\resample.py", line 277, in _resample_timestamps
    result = grouped.aggregate(self._agg_method)
  File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2404, in aggregate
    result[col] = colg.aggregate(agg_how)
  File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2076, in aggregate
    ret = self._aggregate_multiple_funcs(func_or_funcs)
  File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2125, in _aggregate_multiple_funcs
    results[name] = self.aggregate(func)
  File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2073, in aggregate
    return getattr(self, func_or_funcs)(*args, **kwargs)
  File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 486, in __getattr__
    (type(self).__name__, attr))
AttributeError: 'SeriesGroupBy' object has no attribute 'A  '
like image 361
asado23 Avatar asked Oct 02 '14 21:10

asado23


1 Answers

The values in the dict have to be either strings representing functions (e.g. 'count'/'sum'/'max') or functions which are passed to each group. What you are passing is the result (the value) mode(travels2012['action']).

So you need to make this a function, which is applied to each group:

In [11]: df.resample('H', how={'station_arrived':'count',
                               'action': lambda x: mode(df['action']),
                                'lat1':'count', 'lon1':'count'})
Out[11]:
                    action  station_arrived  lon1  lat1
date_removed
2012-01-01 13:00:00    [A]                2     2     2
2012-01-01 14:00:00    [A]                0     0     0
2012-01-01 15:00:00    [A]                1     1     1
2012-01-01 16:00:00    [A]                0     0     0
...

I'm not sure that this is what you want (as it is applying to the entire column), perhaps you want to take the mode for each group:

In [12]: df.resample('H', how={'station_arrived':'count',
                               'action': mode, 'lat1':'count', 'lon1':'count'})
Out[12]:
                    action  station_arrived  lon1  lat1
date_removed
2012-01-01 13:00:00    [A]                2     2     2
2012-01-01 14:00:00     []                0     0     0
2012-01-01 15:00:00    [A]                1     1     1
2012-01-01 16:00:00     []                0     0     0
...

I would prefer to see the actual value (A) rather than it in a list, and NaN rather than [].


I think it's worth mentioning the Series mode method, which has the caveat that it always returns a Series (as there may be a draw) and is empty if no value appears more than once.
You could wrap around it as follows (and you can similarly wrap your mode function):

def mode_(s):
    try:
        return s.mode()[0]
    except IndexError:
        return np.nan

In [22]: df.resample('H', how={'station_arrived':'count',
                               'action': mode_, 'lat1':'count', 'lon1':'count'})
Out[22]:
                    action  station_arrived  lon1  lat1
date_removed
2012-01-01 13:00:00      A                2     2     2
2012-01-01 14:00:00    NaN                0     0     0
2012-01-01 15:00:00    NaN                1     1     1
2012-01-01 16:00:00    NaN                0     0     0
...
like image 166
Andy Hayden Avatar answered Oct 05 '22 23:10

Andy Hayden