Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I subclass a Pandas DataFrame?

Subclassing Pandas classes seems a common need, but I could not find references on the subject. (It seems that Pandas developers are still working on it: Easier subclassing #60.)

There are some SO questions on the subject, but I am hoping that someone here can provide a more systematic account on the current best way to subclass pandas.DataFrame that satisfies two general requirements:

  1. calling standard DataFrame methods on instances of MyDF should produce instances of MyDF
  2. calling standard DataFrame methods on instances of MyDF should leave all attributes still attached to the output

(And are there any significant differences for subclassing pandas.Series?)

Code for subclassing pd.DataFrame:

import numpy as np import pandas as pd  class MyDF(pd.DataFrame):     # how to subclass pandas DataFrame?     pass  mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D']) print(type(mydf))  # <class '__main__.MyDF'>  # Requirement 1: Instances of MyDF, when calling standard methods of DataFrame, # should produce instances of MyDF. mydf_sub = mydf[['A','C']] print(type(mydf_sub))  # <class 'pandas.core.frame.DataFrame'>  # Requirement 2: Attributes attached to instances of MyDF, when calling standard # methods of DataFrame, should still attach to the output. mydf.myattr = 1 mydf_cp1 = MyDF(mydf) mydf_cp2 = mydf.copy() print(hasattr(mydf_cp1, 'myattr'))  # False print(hasattr(mydf_cp2, 'myattr'))  # False 
like image 470
Lei Avatar asked Mar 03 '14 19:03

Lei


People also ask

Can pandas Dataframe store different data types?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

How do I use the ILOC method?

Using the iloc method in python, we can easily retrieve any particular value from a row or column by using index values. The iloc function in python takes two optional parameters i.e. row number(s) and column number(s). We can only pass integer type values as parameter(s) in the iloc function in python.

How do you extend a Dataframe in Python?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value. ignore_index : If True, do not use the index labels.

How do I select a pandas Dataframe?

To select a single column, use square brackets [] with the column name of the column of interest.


1 Answers

There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.

The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas

The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py

As in HYRY's answer, it seems there are two things you're trying to accomplish:

  1. When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the _constructor property which should return your type.
  2. Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special _metadata attribute.

Here's an example:

class SubclassedDataFrame(DataFrame):     _metadata = ['added_property']     added_property = 1  # This will be passed to copies      @property     def _constructor(self):         return SubclassedDataFrame 
like image 56
cjrieds Avatar answered Oct 05 '22 21:10

cjrieds