Subclassing Pandas classes seems a common need, but I could not find references on the subject. (It seems that Pandas developers are still working on it: Easier subclassing #60.) There are some SO questions on the subject, but I am hoping that someone here can provide a more systematic account on the current best way to subclass <code>pandas.DataFrame</code> that satisfies two general requirements: <ol> <li>calling standard DataFrame methods on instances of MyDF should produce instances of MyDF </li> <li>calling standard DataFrame methods on instances of MyDF should leave all attributes still attached to the output </li> </ol> (And are there any significant differences for subclassing pandas.Series?) Code for subclassing <code>pd.DataFrame</code>: <pre class="prettyprint"><code>import numpy as np import pandas as pd class MyDF(pd.DataFrame): # how to subclass pandas DataFrame? pass mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D']) print(type(mydf)) # <class '__main__.MyDF'> # Requirement 1: Instances of MyDF, when calling standard methods of DataFrame, # should produce instances of MyDF. mydf_sub = mydf[['A','C']] print(type(mydf_sub)) # <class 'pandas.core.frame.DataFrame'> # Requirement 2: Attributes attached to instances of MyDF, when calling standard # methods of DataFrame, should still attach to the output. mydf.myattr = 1 mydf_cp1 = MyDF(mydf) mydf_cp2 = mydf.copy() print(hasattr(mydf_cp1, 'myattr')) # False print(hasattr(mydf_cp2, 'myattr')) # False </code></pre>

There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series. The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py As in HYRY's answer, it seems there are two things you're trying to accomplish: <ol> <li>When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the <code>_constructor</code> property which should return your type.</li> <li>Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special <code>_metadata</code> attribute.</li> </ol> Here's an example: <pre class="prettyprint"><code>class SubclassedDataFrame(DataFrame): _metadata = ['added_property'] added_property = 1 # This will be passed to copies @property def _constructor(self): return SubclassedDataFrame </code></pre>

How can I subclass a Pandas DataFrame?

Tags:

python

pandas

dataframe

subclassing

Subclassing Pandas classes seems a common need, but I could not find references on the subject. (It seems that Pandas developers are still working on it: Easier subclassing #60.)

There are some SO questions on the subject, but I am hoping that someone here can provide a more systematic account on the current best way to subclass pandas.DataFrame that satisfies two general requirements:

calling standard DataFrame methods on instances of MyDF should produce instances of MyDF
calling standard DataFrame methods on instances of MyDF should leave all attributes still attached to the output

(And are there any significant differences for subclassing pandas.Series?)

Code for subclassing pd.DataFrame:

import numpy as np import pandas as pd  class MyDF(pd.DataFrame):     # how to subclass pandas DataFrame?     pass  mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D']) print(type(mydf))  # <class '__main__.MyDF'>  # Requirement 1: Instances of MyDF, when calling standard methods of DataFrame, # should produce instances of MyDF. mydf_sub = mydf[['A','C']] print(type(mydf_sub))  # <class 'pandas.core.frame.DataFrame'>  # Requirement 2: Attributes attached to instances of MyDF, when calling standard # methods of DataFrame, should still attach to the output. mydf.myattr = 1 mydf_cp1 = MyDF(mydf) mydf_cp2 = mydf.copy() print(hasattr(mydf_cp1, 'myattr'))  # False print(hasattr(mydf_cp2, 'myattr'))  # False

470

asked Mar 03 '14 19:03

Lei

1 Answers

There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.

The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas

The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py

As in HYRY's answer, it seems there are two things you're trying to accomplish:

When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the _constructor property which should return your type.
Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special _metadata attribute.

Here's an example:

class SubclassedDataFrame(DataFrame):     _metadata = ['added_property']     added_property = 1  # This will be passed to copies      @property     def _constructor(self):         return SubclassedDataFrame

answered Oct 05 '22 21:10

cjrieds

Related questions
                            
                                Ctrl-C crashes Python after importing scipy.stats
                            
                                Changing iteration variable inside for loop in Python [duplicate]
                            
                                python pass different **kwargs to multiple functions
                            
                                Tensorflow: How to replace a node in a calculation graph?
                            
                                Pandas groupby with categories with redundant nan
                            
                                Shading an area between two points in a matplotlib plot
                            
                                login() in Django testing framework
                            
                                Why does Python have a format function as well as a format method
                            
                                Memory usage keep growing with Python's multiprocessing.pool
                            
                                Is Python variable assignment atomic?
                            
                                How to get (sub)class name from a static method in Python?
                            
                                Python list multiplication: [[...]]*3 makes 3 lists which mirror each other when modified [duplicate]
                            
                                Django unique=True not working
                            
                                Is it safe to just implement __lt__ for a class that will be sorted?
                            
                                How to share secondary y-axis between subplots in matplotlib
                            
                                Difference between various numpy random functions
                            
                                Why Python's list does not have shift/unshift methods?
                            
                                How can i process multi loss in pytorch?
                            
                                Inspect python class attributes
                            
                                How to compare a list of lists/sets in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With