Note: The thread below prompted a pull request which was eventually merged into v1.10. This issue is now resolved. I'm using a subclassed DataFrame so that I can have more convenient access to some transformation methods and metadata particular to my use-case. Most of the DataFrame operations work as expected, in that they return an instance of the subclass, rather than an instance of <code>pandas.DataFrame</code>. However, aggregation operations like <code>DataFrame.groupby</code> and <code>DataFrame.resample</code> seem to mess this up. Is this a bug, or have a missed something when defining my subclass? Below is a minimal example, tested on pandas 0.25.1: <pre class="prettyprint lang-py prettyprint-override"><code>class MyDataFrame(pd.DataFrame): @property def _constructor(self): return MyDataFrame dates = pd.date_range('2019', freq='D', periods=365) my_df = MyDataFrame(range(len(dates)), index=dates) assert isinstance(my_df, MyDataFrame) # Success! assert isinstance(my_df.diff(), MyDataFrame) # Success! assert isinstance(my_df.sample(10), MyDataFrame) # Success! assert isinstance(my_df[:10], MyDataFrame) # Success! assert isinstance(my_df.resample("D").sum(), MyDataFrame) # AssertionError assert isinstance(my_df.groupby(df.index.month).sum(), MyDataFrame) # AssertionError </code></pre>

I don't know if it's a "bug" per-se, but I agree that it should be changed regardless. If you take a look at some of the source code for groupby-type objects, you'll see a lot of hardcoded <code>return DataFrame(...)</code> and <code>return Series(...)</code>. As you rightfully pointed out, Pandas objects have three methods to be used to construct new versions of themselves: <ul> <li> <code>_construct()</code> to create objects of the same type </li> <li> <code>_construct_sliced()</code> to create a series-like object from a dataframe-like object</li> <li> <code>_construct_expanddim()</code> to create a dataframe-like object from a series-like object</li> </ul> These can be used instead of the hardcoded types in <code>core/groupby/generic.py</code>, which is easy to do since the groupby objects store the starting <code>NDFrame</code> as the attribute <code>obj</code>. A branch with these changes implemented can be found on my fork here: https://github.com/alkasm/pandas/tree/groupby-preserve-subclass

Pandas groupby, resample, etc for subclassed DataFrame

Tags:

python

pandas

dataframe

subclassing

Note: The thread below prompted a pull request which was eventually merged into v1.10. This issue is now resolved.

I'm using a subclassed DataFrame so that I can have more convenient access to some transformation methods and metadata particular to my use-case. Most of the DataFrame operations work as expected, in that they return an instance of the subclass, rather than an instance of pandas.DataFrame. However, aggregation operations like DataFrame.groupby and DataFrame.resample seem to mess this up.

Is this a bug, or have a missed something when defining my subclass?

Below is a minimal example, tested on pandas 0.25.1:

class MyDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDataFrame

dates = pd.date_range('2019', freq='D', periods=365)
my_df = MyDataFrame(range(len(dates)), index=dates)

assert isinstance(my_df, MyDataFrame)
# Success!

assert isinstance(my_df.diff(), MyDataFrame)
# Success!

assert isinstance(my_df.sample(10), MyDataFrame)
# Success!

assert isinstance(my_df[:10], MyDataFrame)
# Success!

assert isinstance(my_df.resample("D").sum(), MyDataFrame)
# AssertionError

assert isinstance(my_df.groupby(df.index.month).sum(), MyDataFrame)
# AssertionError

771

asked Sep 04 '19 22:09

grge

1 Answers

I don't know if it's a "bug" per-se, but I agree that it should be changed regardless. If you take a look at some of the source code for groupby-type objects, you'll see a lot of hardcoded return DataFrame(...) and return Series(...).

As you rightfully pointed out, Pandas objects have three methods to be used to construct new versions of themselves:

_construct() to create objects of the same type
_construct_sliced() to create a series-like object from a dataframe-like object
_construct_expanddim() to create a dataframe-like object from a series-like object

These can be used instead of the hardcoded types in core/groupby/generic.py, which is easy to do since the groupby objects store the starting NDFrame as the attribute obj.

A branch with these changes implemented can be found on my fork here: https://github.com/alkasm/pandas/tree/groupby-preserve-subclass

191

answered Oct 16 '22 13:10

alkasm

Related questions
                            
                                ValueError: The model is not configured to compute accuracy
                            
                                Automating database creation for testing
                            
                                How to find nearest divisor to given value with modulo zero
                            
                                Logging DEBUG logs are not shown when executing the Python Azure Functions
                            
                                Pandas - substring each row with a different length
                            
                                ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:
                            
                                AWS Lambda - SQS Integration with Exponential Backoff
                            
                                How to join many fragmented time series in one regular Pandas DataFrame in Python
                            
                                How to fix Tkinter? Every code with GUI crashes mac os with respring
                            
                                How to provide an async function in PythonOperator's python_callable in Airflow?
                            
                                Sending over the same socket with multiprocessing.pool.map
                            
                                Break up a list of strings in a pandas dataframe column into new columns based on first word of each sentence
                            
                                what are count0, count1 and count2 values returned by the Python gc.get_count()
                            
                                HTTP/2 requests and headers starting with colon
                            
                                Simple data operations: R vs python
                            
                                pandas: How to keep the last `n` records of each group sorted by another variable?
                            
                                scipy UnivariateSpline fails with multivalued X
                            
                                How to cancel a pending wait_for
                            
                                How to improve the quality of the audio of RTMP stream after multiplexing two streams
                            
                                cast tensorflow 2.0 BatchDataset to numpy array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With