Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame Object Inheritance or Object Use?

I am building a library for working with very specific structured data and I am building my infrastructure on top of Pandas. Currently I am writing a bunch of different data containers for different use cases, such as CTMatrix for Country x Time Data etc. to house methods appropriate for all CountryxTime structured data.

I am currently debating between

Option 1: Object Inheritance

class CTMatrix(pd.DataFrame):
    methods etc. here

or Option 2: Object Use

class CTMatrix(object):
    _data = pd.DataFrame

    then use getter, setter methods to control access to _data etc. 

From a software engineering perspective is there an obvious choice here?

My thoughts so far are:

Option 1:

  1. Can use DataFrame methods directly on the CTMatrix Class (like CTmatrix.sort()) without having to support them via methods on the encapsulated _data object in Option #2
  2. Updates and New methods in Pandas are inherited, except for methods that may be overwritten with local class methods

BUT

  1. Complications with some methods such as __init__() and having to pass the attributes up to the superclass super(MyDF, self).__init__(*args, **kw)

Option 2:

  1. More control over the Class and it's behavior
  2. Possibly more resilient to updates in Pandas?

But

  1. Having to use a getter() or non-hidden attribute to use the object like a dataframe such as (CTMatrix.data.sort())

Are there any additional downsides for taking the approach in Option #1?

like image 811
sanguineturtle Avatar asked Jul 01 '14 07:07

sanguineturtle


People also ask

Is pandas DataFrame an object?

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

What type of object is a pandas DataFrame?

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas.

What does object mean pandas?

An object is a string in pandas so it performs a string operation instead of a mathematical one. If we want to see what all the data types are in a dataframe, use df.dtypes. df.

How do you extend a DataFrame in Python?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value.


2 Answers

I would avoid subclassing DataFrame, because many of the DataFrame methods will return a new DataFrame and not another instance of your CTMatrix object.

There are a few of open issues on GitHub around this e.g.:

  • https://github.com/pydata/pandas/issues/60

  • https://github.com/pydata/pandas/issues/2485

More generally, this is a question of composition vs inheritance. I would be especially wary of benefit #2. It might seem great now, but unless you are keeping a close eye on updates to Pandas (and it is a fast moving target), you can easily end up with unexpected consequences and your code will end up intertwined with Pandas.

like image 77
Matti John Avatar answered Oct 04 '22 12:10

Matti John


Because of similar issues and Matti John's answer I wrote a _pandas_wrapper class for a project of mine, because I also wanted to inherit from pandas Dataframe.

https://github.com/mcocdawc/chemcoord/blob/bdfc186f54926ef356d0b4830959c51bb92d5583/src/chemcoord/_generic_classes/_pandas_wrapper.py

The only purpose of this class is to give a pandas DataFrame lookalike that is safe to inherit from.

If your project is LGPL licensed you can reuse it without problems.

like image 24
mcocdawc Avatar answered Oct 04 '22 10:10

mcocdawc