Logo Questions Linux Laravel Mysql Ubuntu Git Menu

pandas: Composition for chained methods like .resample(), .rolling() etc

I would like to construct an extension of pandas.DataFrame — let's call it SPDF — which could do stuff above and beyond what a simple DataFrame can:

import pandas as pd
import numpy as np

def to_spdf(func):
    """Transform generic output of `func` to SPDF.

    wrapper : callable
    def wrapper(*args, **kwargs):
        res = func(*args, **kwargs)
        return SPDF(res)

    return wrapper

class SPDF:
    """Special-purpose dataframe.

    df : pandas.DataFrame


    def __init__(self, df):
        self.df = df

    def __repr__(self):
        return repr(self.df)

    def __getattr__(self, item):
        res = getattr(self.df, item)

        if callable(res):
            res = to_spdf(res)

        return res

if __name__ == "__main__":

    # construct a generic SPDF
    df = pd.DataFrame(np.eye(4))
    an_spdf = SPDF(df)

    # call .diff() to obtain another SPDF

Right now, methods of DataFrame that return another DataFrame, such as .diff() in the MWE above, return me another SPDF, which is great. However, I would also like to trick chained methods such as .resample('M').last() or .rolling(2).mean() into producing an SPDF in the very end. I have failed so far because .rolling() and the like are of type callable, and my wrapper to_spdf tries to construct an SPDF from their output without 'waiting' for .mean() or any other last part of the expression. Any ideas how to tackle this problem?


like image 666
Igor Pozdeev Avatar asked Jul 11 '18 07:07

Igor Pozdeev

People also ask

What is method chaining in pandas?

Pandas Chaining: Method chaining, in which methods are called on an object sequentially, one after the another. It has always been a programming style that's been possible with pandas, and over the past few releases, many methods have been introduced that allow even more chaining.

How do you resample a dataset in Python?

resample() method. To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them. In this case, you want total daily rainfall, so you will use the resample() method together with . sum() .

How do I change the index of a data frame?

To reset the index in pandas, you simply need to chain the function . reset_index() with the dataframe object. On applying the . reset_index() function, the index gets shifted to the dataframe as a separate column.

How do you remove a column from a Dataframe in Python?

Pandas DataFrame drop() Method The drop() method removes the specified row or column. By specifying the column axis ( axis='columns' ), the drop() method removes the specified column. By specifying the row axis ( axis='index' ), the drop() method removes the specified row.

1 Answers

You should be properly subclassing dataframe. In order to get copy-constructor methods to work, pandas describes that you must set the _constructor property (along with other information).

You could do something like the following:

class SPDF(DataFrame):

    def _constructor(self):
        return SPDF

If you need to preserve custom attributes (not functions - those will be there), during copy-constructor methods (like diff), then you can do something like the following

class SPDF(DataFrame):
    _metadata = ['prop']
    prop = 1

    def _constructor(self):
        return SPDF

Notice the output is as desired:

df = SPDF(np.eye(4))
[<class '__main__.SPDF'>]
new = df.diff()
[<class '__main__.SPDF'>]
like image 167
modesitt Avatar answered Oct 26 '22 23:10
