Note: The thread below prompted a pull request which was eventually merged into v1.10. This issue is now resolved.
I'm using a subclassed DataFrame so that I can have more convenient access to some transformation methods and metadata particular to my use-case. Most of the DataFrame operations work as expected, in that they return an instance of the subclass, rather than an instance of pandas.DataFrame
. However, aggregation operations like DataFrame.groupby
and DataFrame.resample
seem to mess this up.
Is this a bug, or have a missed something when defining my subclass?
Below is a minimal example, tested on pandas 0.25.1:
class MyDataFrame(pd.DataFrame):
@property
def _constructor(self):
return MyDataFrame
dates = pd.date_range('2019', freq='D', periods=365)
my_df = MyDataFrame(range(len(dates)), index=dates)
assert isinstance(my_df, MyDataFrame)
# Success!
assert isinstance(my_df.diff(), MyDataFrame)
# Success!
assert isinstance(my_df.sample(10), MyDataFrame)
# Success!
assert isinstance(my_df[:10], MyDataFrame)
# Success!
assert isinstance(my_df.resample("D").sum(), MyDataFrame)
# AssertionError
assert isinstance(my_df.groupby(df.index.month).sum(), MyDataFrame)
# AssertionError
Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe. For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame.
To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values(). The size() method is used to get the dataframe size.
The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.
I don't know if it's a "bug" per-se, but I agree that it should be changed regardless. If you take a look at some of the source code for groupby-type objects, you'll see a lot of hardcoded return DataFrame(...)
and return Series(...)
.
As you rightfully pointed out, Pandas objects have three methods to be used to construct new versions of themselves:
_construct()
to create objects of the same type _construct_sliced()
to create a series-like object from a dataframe-like object_construct_expanddim()
to create a dataframe-like object from a series-like objectThese can be used instead of the hardcoded types in core/groupby/generic.py
, which is easy to do since the groupby objects store the starting NDFrame
as the attribute obj
.
A branch with these changes implemented can be found on my fork here: https://github.com/alkasm/pandas/tree/groupby-preserve-subclass
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With