Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby, resample, etc for subclassed DataFrame

Note: The thread below prompted a pull request which was eventually merged into v1.10. This issue is now resolved.

I'm using a subclassed DataFrame so that I can have more convenient access to some transformation methods and metadata particular to my use-case. Most of the DataFrame operations work as expected, in that they return an instance of the subclass, rather than an instance of pandas.DataFrame. However, aggregation operations like DataFrame.groupby and DataFrame.resample seem to mess this up.

Is this a bug, or have a missed something when defining my subclass?

Below is a minimal example, tested on pandas 0.25.1:

class MyDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDataFrame

dates = pd.date_range('2019', freq='D', periods=365)
my_df = MyDataFrame(range(len(dates)), index=dates)

assert isinstance(my_df, MyDataFrame)
# Success!

assert isinstance(my_df.diff(), MyDataFrame)
# Success!

assert isinstance(my_df.sample(10), MyDataFrame)
# Success!

assert isinstance(my_df[:10], MyDataFrame)
# Success!

assert isinstance(my_df.resample("D").sum(), MyDataFrame)
# AssertionError

assert isinstance(my_df.groupby(df.index.month).sum(), MyDataFrame)
# AssertionError
like image 771
grge Avatar asked Sep 04 '19 22:09

grge


People also ask

Does pandas Groupby preserve order?

Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

Does Groupby return a Dataframe or series?

So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe. For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame.

Can you sort a Groupby pandas?

To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values(). The size() method is used to get the dataframe size.

Does Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.


1 Answers

I don't know if it's a "bug" per-se, but I agree that it should be changed regardless. If you take a look at some of the source code for groupby-type objects, you'll see a lot of hardcoded return DataFrame(...) and return Series(...).

As you rightfully pointed out, Pandas objects have three methods to be used to construct new versions of themselves:

  • _construct() to create objects of the same type
  • _construct_sliced() to create a series-like object from a dataframe-like object
  • _construct_expanddim() to create a dataframe-like object from a series-like object

These can be used instead of the hardcoded types in core/groupby/generic.py, which is easy to do since the groupby objects store the starting NDFrame as the attribute obj.

A branch with these changes implemented can be found on my fork here: https://github.com/alkasm/pandas/tree/groupby-preserve-subclass

like image 191
alkasm Avatar answered Oct 16 '22 13:10

alkasm