Pandas as fast data storage for Flask application

Q: Is Panda faster than database?

pandas is faster for the following tasks: groupby computation of a mean and sum (significantly better for large data, only 2x faster for <10k records) load data from disk (5x faster for >10k records, even better for smaller data)

Tags:

python

pandas

caching

flask

pytables

I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.

I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).

The question is:

(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?

(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).

I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.

236

asked Jul 09 '14 02:07

orange

1 Answers

Answers to some aspects of what you're asking for:

It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.

If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.

Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.

You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.

100

answered Sep 19 '22 14:09

Midnighter

Related questions
                            
                                uwsgi service is not starting
                            
                                In celery how to get the task status for all the tasks for specific task name?
                            
                                Can't find the source of exception
                            
                                What are some general tips to reduce the file size for a Pyinstaller generated executable
                            
                                How to stop argparse.FileType creating the file specified as default
                            
                                Pandas Merge Error: MemoryError
                            
                                Numba autojit error on comparing numpy arrays
                            
                                Naming convention for context-manager classes ("with" blocks)
                            
                                Remove python circular import
                            
                                scipy.signal.resample behaves strangely
                            
                                Mocking Python iterables for use with Sphinx
                            
                                np.rot90() corrupts an opencv image
                            
                                translating strings from database flask-babel
                            
                                Matlab importdata() function equivalent in Python
                            
                                Adding effects to make voice sound like it’s over a telephone
                            
                                in Python use of hierarchy for findContours
                            
                                Django: using same test database in a separate thread
                            
                                Numpy.dot bug? Inconsistent NaN behavior
                            
                                Python mysql.connector timeout
                            
                                Deleting multiple slices from a numpy array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With