Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split-apply-combine on pandas timedelta column

Tags:

python

pandas

I have a DataFrame with a column of timedeltas (actually upon inspection the dtype is timedelta64[ns] or <m8[ns]), and I'd like to do a split-combine-apply, but the timedelta column is being dropped:

import pandas as pd

import numpy as np

pd.__version__
Out[3]: '0.13.0rc1'

np.__version__
Out[4]: '1.8.0'

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

data
Out[8]: 
         f1        f2              td
0  0.990140  0.948313 00:00:00.003066
1  0.277125  0.993549 00:00:00.001443
2  0.016427  0.581129 00:00:00.009257
3  0.048662  0.512215 00:00:00.000702
4  0.846301  0.179160 00:00:00.000396
5  0.568323  0.419887 00:00:00.000266
6  0.328182  0.919897 00:00:00.006138
7  0.292882  0.213219 00:00:00.008876
8  0.623332  0.003409 00:00:00.000322
9  0.650436  0.844180 00:00:00.006873

[10 rows x 3 columns]

data.groupby(data.index < 5).mean()
Out[9]: 
             f1        f2
False  0.492631  0.480118
True   0.435731  0.642873

[2 rows x 2 columns]

Or, forcing pandas to try the operation on the 'td' column:

data.groupby(data.index < 5)['td'].mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-12-88cc94e534b7> in <module>()
----> 1 data.groupby(data.index < 5)['td'].mean()

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in mean(self)
    417         """
    418         try:
--> 419             return self._cython_agg_general('mean')
    420         except GroupByError:
    421             raise

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
    669 
    670         if len(output) == 0:
--> 671             raise DataError('No numeric types to aggregate')
    672 
    673         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

However, taking the mean of the column works fine, so numeric operations should be possible:

data['td'].mean()
Out[11]: 
0   00:00:00.003734
dtype: timedelta64[ns]

Obviously it's easy enough to coerce to float before doing the groupby, but I figured I might as well try to understand what I'm running into.

Edit: See https://github.com/pydata/pandas/issues/5724

like image 703
ontologist Avatar asked Dec 17 '13 04:12

ontologist


People also ask

How to split a string column in a pandas Dataframe?

You can use the following basic syntax to split a string column in a pandas DataFrame into multiple columns: #split column A into two columns: column A and column B df[[' A ', ' B ']] = df[' A ']. str. split (', ', 1, expand= True) The following examples show how to use this syntax in practice. Example 1: Split Column by Comma

What is split-apply-combine in pandas?

Master the Split-Apply-Combine pattern in Python with this visual guide to Pandas groupby-apply. Pandas groupby-apply is an invaluable tool in a Python data scientist’s toolkit. You can go pretty far with it without fully understanding all of its internal intricacies. However, sometimes that can manifest itself in unexpected behavior and errors.

What is timedelta in pandas?

pandas.Timedelta. ¶. Represents a duration, the difference between two dates or times. Timedelta is the pandas equivalent of python’s datetime.timedelta and is interchangeable with it in most cases. Denote the unit of the input, if input is an integer. ‘nanoseconds’, ‘nanosecond’, ‘nanos’, ‘nano’, or ‘ns’.

What is the best way to split data into groups?

Applying a function to each group independently. Combining the results into a data structure. Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups.


1 Answers

Turns out this is a pandas issue, this behavior needs to be implemented in groupby.py.

In the meantime, please enjoy this workaround that casts to float (units of seconds):

data['td'] = [10**-9 * float(td) for td in data['td']]
like image 57
ontologist Avatar answered Oct 19 '22 19:10

ontologist