Static typing/schema of a pandas dataframe

Tags:

pandas

Is there a way to hint about a pandas DataFrame's schema "statically" so that we can get code completion, static type checking, and just general predictability during coding?

I wouldn't mind duplicating the schema info in code and type annotation for this to work..

So maybe something roughly like mypy comment type annotations:

df = pd.DataFrame({'a': [1.0, 2.4, 4.5], 'B': [1,2,3]})  # pd.schema: ('a': np.dtype(float)), ('B': np.dtype(int))

(or better yet have the schema specified in some external JSON file or such)

Then you can image things like df. auto-completing during coding to df.a or df.B. Or mypy (and any other static code analyzer) being able to infer the type of df.B[0] and such.

Although hopeful, I'm guessing this isn't really possible (or desired...). If so, what would be a good standard for writing good reusable code that returns pd.DataFrame's with specific columns? So imagine there's a function get_data() -> pd.DataFrame that returns data with columns that are known in advance - how would you make this transparent to a user of this function? Anything smarter / more standardized than just spelling it out in the function's docstring?

857

asked Apr 21 '19 21:04

stav

2 Answers

pandera should be what you need.

A data validation library for scientists, engineers, and analysts seeking correctness.

116

answered Oct 19 '22 15:10

dasons

This may be something you already know, but a reliable way to get the auto-completion you are after is to develop code "live" in Jupyter notebooks. It's very commonly used in data science applications - for your instance it might be appropriate to instantiate a version of the DataFrame with the types that you are looking for at the top of the notebook, then Jupyter will provide the autocomplete for the columns and types as you code. Obviously it has a big advantage over the IDE in terms of knowing what is in scope, because the dataframe is actually loaded into memory as you are developing.

Per above_c_level's comment, dataenforce looks promising for its connection with pytest (ie. testing after code is developed), but unless there are some fancy integrations with your IDE I don't think it will be able to match Jupyter's "live knowledge" of the object.

answered Oct 19 '22 13:10

DaveB

Related questions
                            
                                How to implement an append-only versioned model in SQLAlchemy
                            
                                pycrypto - Ciphertext with incorrect length
                            
                                Construct sparse matrix on disk on the fly in Python
                            
                                Why does hash(None) change across different platforms and in different calls?
                            
                                How to avoid repetitive filter specification in mako %def's?
                            
                                How to structure a program to work with minesweeper configurations
                            
                                Django: Forcing CSRF token on all responses
                            
                                Is there a fast Way to return Sin and Cos of the same value in Python?
                            
                                unbuffered read from stdin in python
                            
                                Kivy--Plyer--Android--sending notification while app is not running
                            
                                How to process panel data for use in a recurrent neural network (RNN)
                            
                                Caching ordered Spark DataFrame creates unwanted job
                            
                                Is it possible to redirect cell output in jupyter
                            
                                Numpy: Multiplying large arrays with dtype=int8 is SLOW
                            
                                Why can't my DQN agent find the optimal policy in a non-deterministic environment?
                            
                                Jupyter Client Connect to Running Kernel via Python
                            
                                Python node2vec (Gensim Word2Vec) "Process finished with exit code 134 (interrupted by signal 6: SIGABRT)"
                            
                                uWSGI error hr_instance_read(): Connection reset by peer
                            
                                Python multiprocessing copy-on-write behaving differently between OSX and Ubuntu
                            
                                Calling super's forward() method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With