How do you specify a Pandas DataFrame schema/structure in a docstring?

Tags:

I'd like to describe the DataFrame structure my Python function expects, and a verbal description like:

def myfun(input):
    """ Does a thing.
    Parameters
    ----------
    input : pandas.DataFrame
        column 1 is called 'thing1' and it is of dtype 'i4'"
    """
    ...

feels error prone. Is there a conventional way to describe it? I don't see anything in the Pandas docstring documentation.

268

asked Mar 03 '19 17:03

jkmacc

1 Answers

Since there is no official standard, my answer is inevitable, opinionated.

ANSWER

I suggest using a description based on the repr() of a Series because each dataframe can be described as a collection of series. It should also be based on the pandas docstring guide for developers.

def myfun(input):
    """ Does a thing.
    Parameters
    ----------
    input : pandas.DataFrame
        Index:
            RangeIndex
        Columns:
            Name: Date, dtype: datetime64[ns]
            Name: Integer, dtype: int64
            Name: Float, dtype: float64
            Name: Object, dtype: object

    """

Example dataframe:

data = [[pd.Timestamp(2020, 1, 1), 1, 1.1, "A"],
        [pd.Timestamp(2020, 1, 2), 2, 2.2, "B"]]
input = pd.DataFrame.from_records(data=data, columns=['Date', 'Integer', 'Float', 'Object'])

Output:

    Date        Integer     Float   Object
0   2020-01-01  1           1.1     A
1   2020-01-02  2           2.2     B

GENERAL DEFINITION

<dataframe name>: pandas.DataFrame
    Index:
        <__repr__ of Index>
            <Optional: Description of index data>
    Columns:
        <last line of __repr__ of pd.Series object of first column>
            <Optional: Description of column data>
        ...
        <last line of __repr__ of pd.Series object of last column>
            <Optional: Description of column data>

EXPLANATION

There is a detailed discussion of how table data can be standardized. From this discussion, standards such as ISO/IEC 11179, the JSON Table Schema and the W3C Tabular Data Model emerged. However, they are not perfect for describing a dataframe in a docstring. For example, you need to consider relationships with other tables, which is important for database applications, but not for Pandas dataframes.

My proposed repr-based approach has several advantages:

It respects the opinion of the core developers of Pandas. The repr was what we should see about the object.
It is efficient. Let's face it, documentation is difficult. Automation is very simple with this approach. An example can be found below.
It is evolving. If the repr ever changes, the docstring also changes.
It is expandable. If you like to include additional meta data, the dataframe object has many more attributes that you can include.

Example of an automatically generated docstring with additional meta data:

df = input.copy()
df = df.set_index('Date')
docstring = 'Index:\n'
docstring = docstring + f'    {df.index}\n'
docstring = docstring + 'Columns:\n'
for col in df.columns:    
    docstring = docstring + f'    Name: {df[col].name}, dtype={df[col].dtype}, nullable: {df[col].hasnans}\n'

Output:

Index:
    DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', name='Date', freq=None)
Columns:
    Name: Integer, dtype=int64, nullable: False
    Name: Float, dtype=float64, nullable: False
    Name: Object, dtype=object, nullable: False

answered Oct 24 '22 00:10

above_c_level

Related questions
                            
                                How to sync Colors across Subplots of different types Seaborne / Matplotlib
                            
                                Why is it required to typecast a map into a list to assign it to a pandas series?
                            
                                How to merge pandas on string contains?
                            
                                Set up a mock database in Python for unit testing
                            
                                How to combine queries with a single external variable using Pandas
                            
                                How to assign columns while ignoring index alignment
                            
                                Indexing and Data Columns in Pandas/PyTables
                            
                                Ordered Logit in Python?
                            
                                Load directly gz file into pandas dataframe
                            
                                How add asymmetric errorbars to Pandas grouped barplot?
                            
                                Avoid pandas str.replace using a regex
                            
                                Pandas read_sql query with multiple selects
                            
                                extract hour from timestamp with python
                            
                                pandas.DataFrame: .hist() vs .plot.hist() methods
                            
                                Error when using pandas read_excel(header=[0,1])
                            
                                How to get rolling pandas dataframe subsets
                            
                                KeyError: ('count', 'occurred at index 0')
                            
                                Using Pandas value.counts() to get one value
                            
                                Pandas DataFrame.from_dict() poor performance when generating from a lengthy dict of dicts
                            
                                Pandas dataframe raises KeyError when sort_values() method is called

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you specify a Pandas DataFrame schema/structure in a docstring?

Tags:

pandas

dataframe

docstring

jkmacc

People also ask

1 Answers

above_c_level

Recent Activity

Donate For Us