I'd like to describe the DataFrame structure my Python function expects, and a verbal description like:
def myfun(input):
""" Does a thing.
Parameters
----------
input : pandas.DataFrame
column 1 is called 'thing1' and it is of dtype 'i4'"
"""
...
feels error prone. Is there a conventional way to describe it? I don't see anything in the Pandas docstring documentation.
Each might contain a table called user_rankings generated in pandas and written using the to_sql command. You would specify the test schema when working on improvements to user rankings. When you are ready to deploy the new rankings, you would write to the prod schema.
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns like a spreadsheet or SQL table, or a dict of Series objects. .
Declaring Docstrings: The docstrings are declared using ”'triple single quotes”' or “””triple double quotes””” just below the class, method or function declaration. All functions should have a docstring.
Since there is no official standard, my answer is inevitable, opinionated.
ANSWER
I suggest using a description based on the repr() of a Series because each dataframe can be described as a collection of series. It should also be based on the pandas docstring guide for developers.
def myfun(input):
""" Does a thing.
Parameters
----------
input : pandas.DataFrame
Index:
RangeIndex
Columns:
Name: Date, dtype: datetime64[ns]
Name: Integer, dtype: int64
Name: Float, dtype: float64
Name: Object, dtype: object
"""
Example dataframe:
data = [[pd.Timestamp(2020, 1, 1), 1, 1.1, "A"],
[pd.Timestamp(2020, 1, 2), 2, 2.2, "B"]]
input = pd.DataFrame.from_records(data=data, columns=['Date', 'Integer', 'Float', 'Object'])
Output:
Date Integer Float Object
0 2020-01-01 1 1.1 A
1 2020-01-02 2 2.2 B
GENERAL DEFINITION
<dataframe name>: pandas.DataFrame
Index:
<__repr__ of Index>
<Optional: Description of index data>
Columns:
<last line of __repr__ of pd.Series object of first column>
<Optional: Description of column data>
...
<last line of __repr__ of pd.Series object of last column>
<Optional: Description of column data>
EXPLANATION
There is a detailed discussion of how table data can be standardized. From this discussion, standards such as ISO/IEC 11179, the JSON Table Schema and the W3C Tabular Data Model emerged. However, they are not perfect for describing a dataframe in a docstring. For example, you need to consider relationships with other tables, which is important for database applications, but not for Pandas dataframes.
My proposed repr-based approach has several advantages:
Example of an automatically generated docstring with additional meta data:
df = input.copy()
df = df.set_index('Date')
docstring = 'Index:\n'
docstring = docstring + f' {df.index}\n'
docstring = docstring + 'Columns:\n'
for col in df.columns:
docstring = docstring + f' Name: {df[col].name}, dtype={df[col].dtype}, nullable: {df[col].hasnans}\n'
Output:
Index:
DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', name='Date', freq=None)
Columns:
Name: Integer, dtype=int64, nullable: False
Name: Float, dtype=float64, nullable: False
Name: Object, dtype=object, nullable: False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With