Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Static typing/schema of a pandas dataframe

Tags:

python

pandas

Is there a way to hint about a pandas DataFrame's schema "statically" so that we can get code completion, static type checking, and just general predictability during coding?

I wouldn't mind duplicating the schema info in code and type annotation for this to work..

So maybe something roughly like mypy comment type annotations:

df = pd.DataFrame({'a': [1.0, 2.4, 4.5], 'B': [1,2,3]})  # pd.schema: ('a': np.dtype(float)), ('B': np.dtype(int))

(or better yet have the schema specified in some external JSON file or such)

Then you can image things like df. auto-completing during coding to df.a or df.B. Or mypy (and any other static code analyzer) being able to infer the type of df.B[0] and such.

Although hopeful, I'm guessing this isn't really possible (or desired...). If so, what would be a good standard for writing good reusable code that returns pd.DataFrame's with specific columns? So imagine there's a function get_data() -> pd.DataFrame that returns data with columns that are known in advance - how would you make this transparent to a user of this function? Anything smarter / more standardized than just spelling it out in the function's docstring?

like image 857
stav Avatar asked Apr 21 '19 21:04

stav


People also ask

What is the datatype of pandas DataFrame?

To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.

How do I print a DataFrame schema in Python?

DataFrame. printSchema() is used to print or display the schema of the DataFrame in the tree format along with column name and data type. If you have DataFrame with a nested structure it displays schema in a nested tree format.

How do I add a schema to a pandas DataFrame?

You create it by subclassing a TypedDataFrame and specifying the schema static variable. Then you can wrap your DataFrame in it by passing it to your Typed DataFrame constructor. The constructor will do a run-time schema validation, and you can access the original DataFrame through the df attribute of a wrapper.

What does Dtypes do in pandas?

The dtypes property returns data type of each column in the DataFrame.


2 Answers

pandera should be what you need.

A data validation library for scientists, engineers, and analysts seeking correctness.

like image 116
dasons Avatar answered Oct 19 '22 15:10

dasons


This may be something you already know, but a reliable way to get the auto-completion you are after is to develop code "live" in Jupyter notebooks. It's very commonly used in data science applications - for your instance it might be appropriate to instantiate a version of the DataFrame with the types that you are looking for at the top of the notebook, then Jupyter will provide the autocomplete for the columns and types as you code. Obviously it has a big advantage over the IDE in terms of knowing what is in scope, because the dataframe is actually loaded into memory as you are developing.

Per above_c_level's comment, dataenforce looks promising for its connection with pytest (ie. testing after code is developed), but unless there are some fancy integrations with your IDE I don't think it will be able to match Jupyter's "live knowledge" of the object.

like image 1
DaveB Avatar answered Oct 19 '22 13:10

DaveB