Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Type Annotating Pandas DataFrames

Tags:

python

pandas

If a function or method returns a Pandas DataFrame, how do you document the column names and column types? Is there a way to do this within Python's builtin type annotation or do you just use docstrings?

If you do just use docstrings, how do you format them to be as succinct as possible?

like image 640
Keiron Stoddart Avatar asked Jul 26 '19 19:07

Keiron Stoddart


People also ask

What is Dtype in pandas?

To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.

What is Dtype object in DataFrame?

DataFrame - dtypes property The dtypes property is used to find the dtypes in the DataFrame. This returns a Series with the data type of each column. The result's index is the original DataFrame's columns. Columns with mixed types are stored with the object dtype.

Is PyArrow faster than pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.

How do I change the type of DataFrame column?

You can change the column type in pandas dataframe using the df. astype() method.


1 Answers

Docstring format

I use the numpy docstring convention as a basis. If a function's input parameter or return parameter is a pandas dataframe with predetermined columns, then I add a reStructuredText-style table with column descriptions to the parameter description. As an example:

def random_dataframe(no_rows):
    """Return dataframe with random data.

    Parameters
    ----------
    no_rows : int
        Desired number of data rows.

    Returns
    -------
    pd.DataFrame
        Dataframe with with randomly selected values. Data columns are as follows:

        ==========  ==============================================================
        rand_int    randomly chosen whole numbers (as `int`)
        rand_float  randomly chosen numbers with decimal parts (as `float`)
        rand_color  randomly chosen colors (as `str`)
        rand_bird   randomly chosen birds (as `str`)
        ==========  ==============================================================

    """
    df = pd.DataFrame({
        "rand_int": np.random.randint(0, 100, no_rows),
        "rand_float": np.random.rand(no_rows),
        "rand_color": np.random.choice(['green', 'red', 'blue', 'yellow'], no_rows),
        "rand_bird": np.random.choice(['kiwi', 'duck', 'owl', 'parrot'], no_rows),
    })

    return df

Bonus: sphinx compatibility

The aforementioned docstring format is compatible with the sphinx autodoc documentation generator. This is how the docstring looks like in HTML documentation that was automatically generated by sphinx (using the nature theme):

sphinx docstring

like image 93
Xukrao Avatar answered Sep 28 '22 03:09

Xukrao