I have a DataFrame with a column of timedeltas (actually upon inspection the dtype is timedelta64[ns]
or <m8[ns]
), and I'd like to do a split-combine-apply, but the timedelta column is being dropped:
import pandas as pd
import numpy as np
pd.__version__
Out[3]: '0.13.0rc1'
np.__version__
Out[4]: '1.8.0'
data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])
data['td'] *= 10000000
data['td'] = pd.Series(data['td'], dtype='<m8[ns]')
data
Out[8]:
f1 f2 td
0 0.990140 0.948313 00:00:00.003066
1 0.277125 0.993549 00:00:00.001443
2 0.016427 0.581129 00:00:00.009257
3 0.048662 0.512215 00:00:00.000702
4 0.846301 0.179160 00:00:00.000396
5 0.568323 0.419887 00:00:00.000266
6 0.328182 0.919897 00:00:00.006138
7 0.292882 0.213219 00:00:00.008876
8 0.623332 0.003409 00:00:00.000322
9 0.650436 0.844180 00:00:00.006873
[10 rows x 3 columns]
data.groupby(data.index < 5).mean()
Out[9]:
f1 f2
False 0.492631 0.480118
True 0.435731 0.642873
[2 rows x 2 columns]
Or, forcing pandas to try the operation on the 'td'
column:
data.groupby(data.index < 5)['td'].mean()
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-12-88cc94e534b7> in <module>()
----> 1 data.groupby(data.index < 5)['td'].mean()
/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in mean(self)
417 """
418 try:
--> 419 return self._cython_agg_general('mean')
420 except GroupByError:
421 raise
/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
669
670 if len(output) == 0:
--> 671 raise DataError('No numeric types to aggregate')
672
673 return self._wrap_aggregated_output(output, names)
DataError: No numeric types to aggregate
However, taking the mean of the column works fine, so numeric operations should be possible:
data['td'].mean()
Out[11]:
0 00:00:00.003734
dtype: timedelta64[ns]
Obviously it's easy enough to coerce to float before doing the groupby, but I figured I might as well try to understand what I'm running into.
Edit: See https://github.com/pydata/pandas/issues/5724
You can use the following basic syntax to split a string column in a pandas DataFrame into multiple columns: #split column A into two columns: column A and column B df[[' A ', ' B ']] = df[' A ']. str. split (', ', 1, expand= True) The following examples show how to use this syntax in practice. Example 1: Split Column by Comma
Master the Split-Apply-Combine pattern in Python with this visual guide to Pandas groupby-apply. Pandas groupby-apply is an invaluable tool in a Python data scientist’s toolkit. You can go pretty far with it without fully understanding all of its internal intricacies. However, sometimes that can manifest itself in unexpected behavior and errors.
pandas.Timedelta. ¶. Represents a duration, the difference between two dates or times. Timedelta is the pandas equivalent of python’s datetime.timedelta and is interchangeable with it in most cases. Denote the unit of the input, if input is an integer. ‘nanoseconds’, ‘nanosecond’, ‘nanos’, ‘nano’, or ‘ns’.
Applying a function to each group independently. Combining the results into a data structure. Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups.
Turns out this is a pandas issue, this behavior needs to be implemented in groupby.py
.
In the meantime, please enjoy this workaround that casts to float (units of seconds):
data['td'] = [10**-9 * float(td) for td in data['td']]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With