Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you specify a Pandas DataFrame schema/structure in a docstring?

I'd like to describe the DataFrame structure my Python function expects, and a verbal description like:

def myfun(input):
    """ Does a thing.
    Parameters
    ----------
    input : pandas.DataFrame
        column 1 is called 'thing1' and it is of dtype 'i4'"
    """
    ...

feels error prone. Is there a conventional way to describe it? I don't see anything in the Pandas docstring documentation.

like image 268
jkmacc Avatar asked Mar 03 '19 17:03

jkmacc


People also ask

How do I set a schema in pandas?

Each might contain a table called user_rankings generated in pandas and written using the to_sql command. You would specify the test schema when working on improvements to user rankings. When you are ready to deploy the new rankings, you would write to the prod schema.

What is the structure of a pandas DataFrame?

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns like a spreadsheet or SQL table, or a dict of Series objects. .

How do you write a docstring in Python?

Declaring Docstrings: The docstrings are declared using ”'triple single quotes”' or “””triple double quotes””” just below the class, method or function declaration. All functions should have a docstring.


1 Answers

Since there is no official standard, my answer is inevitable, opinionated.


ANSWER

I suggest using a description based on the repr() of a Series because each dataframe can be described as a collection of series. It should also be based on the pandas docstring guide for developers.

def myfun(input):
    """ Does a thing.
    Parameters
    ----------
    input : pandas.DataFrame
        Index:
            RangeIndex
        Columns:
            Name: Date, dtype: datetime64[ns]
            Name: Integer, dtype: int64
            Name: Float, dtype: float64
            Name: Object, dtype: object

    """

Example dataframe:

data = [[pd.Timestamp(2020, 1, 1), 1, 1.1, "A"],
        [pd.Timestamp(2020, 1, 2), 2, 2.2, "B"]]
input = pd.DataFrame.from_records(data=data, columns=['Date', 'Integer', 'Float', 'Object'])

Output:

    Date        Integer     Float   Object
0   2020-01-01  1           1.1     A
1   2020-01-02  2           2.2     B

GENERAL DEFINITION

<dataframe name>: pandas.DataFrame
    Index:
        <__repr__ of Index>
            <Optional: Description of index data>
    Columns:
        <last line of __repr__ of pd.Series object of first column>
            <Optional: Description of column data>
        ...
        <last line of __repr__ of pd.Series object of last column>
            <Optional: Description of column data>

EXPLANATION

There is a detailed discussion of how table data can be standardized. From this discussion, standards such as ISO/IEC 11179, the JSON Table Schema and the W3C Tabular Data Model emerged. However, they are not perfect for describing a dataframe in a docstring. For example, you need to consider relationships with other tables, which is important for database applications, but not for Pandas dataframes.

My proposed repr-based approach has several advantages:

  • It respects the opinion of the core developers of Pandas. The repr was what we should see about the object.
  • It is efficient. Let's face it, documentation is difficult. Automation is very simple with this approach. An example can be found below.
  • It is evolving. If the repr ever changes, the docstring also changes.
  • It is expandable. If you like to include additional meta data, the dataframe object has many more attributes that you can include.

Example of an automatically generated docstring with additional meta data:

df = input.copy()
df = df.set_index('Date')
docstring = 'Index:\n'
docstring = docstring + f'    {df.index}\n'
docstring = docstring + 'Columns:\n'
for col in df.columns:    
    docstring = docstring + f'    Name: {df[col].name}, dtype={df[col].dtype}, nullable: {df[col].hasnans}\n'

Output:

Index:
    DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', name='Date', freq=None)
Columns:
    Name: Integer, dtype=int64, nullable: False
    Name: Float, dtype=float64, nullable: False
    Name: Object, dtype=object, nullable: False
like image 67
above_c_level Avatar answered Oct 24 '22 00:10

above_c_level