Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django application having in memory a big Panda object shared across all requests?

I have developed a Shiny Application. When it starts, it loads, ONCE, some datatables. Around 4 GB of datatables. Then, people connecting to the app can use the interface and play with those datatables.

This application is nice but has some limitations. That's why I am looking for another solution.

My idea is to have Pandas and Django working together. This way, I could develop an interface and a RESTful API at the same time. But what I would need is that all requests coming to Django can use pandas datatables that have been loaded once. Imagine if for any request 4 GB of memory had to be loaded... It would be horrible.

I have looked everywhere but couldn't find any way of doing this. I found this question: https://stackoverflow.com/questions/28661255/pandas-sharing-same-dataframe-across-the-request But it has no responses.

Why do I need to have the data in RAM ? Because I need performance to render the asked results fastly. I can't ask MariaDB to calculate and maintain those datas for example as it involves some calculations that sole R or a specialized package in Python or other languages can do.

like image 720
FrelonQuai Avatar asked Nov 12 '15 00:11

FrelonQuai


2 Answers

I have a similar use case where I want to load (instantiate) a certain object only once and use it in all requests, since it takes some time (seconds) to load and I couldn't afford the lag that would introduce for every request.

I use a feature in Django>=1.7, the AppConfig.ready() method, to load this only once.

Here is the code:

# apps.py
from django.apps import AppConfig
from sexmachine.detector import Detector

class APIConfig(AppConfig):
    name = 'api'

    def ready(self):
        # Singleton utility
        # We load them here to avoid multiple instantiation across other
        # modules, that would take too much time.
        print("Loading gender detector..."),
        global gender_detector
        gender_detector = Detector()
        print("ok!")

Then when you want to use it:

from api.apps import gender_detector
gender_detector.get_gender('John')

Load your data table in the ready() method and then use it from anywhere. I reckon the table will be loaded once for each WSGI worker, so be careful here.

like image 179
dukebody Avatar answered Sep 18 '22 21:09

dukebody


I may be misunderstanding the problem but to me having a 4 GB database table that is readily accessible by users shouldn't be too much of a problem. Is there anything wrong with actually just loading up the data one time upfront like you described? 4GB isn't too much RAM now.

Personally I'd recommend you just use the database system itself instead of loading stuff into memory and crunching with python. If you set up the data properly you can issue many thousands of queries in just seconds. Pandas is actually written to mimic SQL so most of the code you are using can probably be translated directly to SQL. Just recently I had a situation at work where I set up a big join operation essentially to take a couple hundred files (~4GB in total, 600k rows per each file) using pandas. The total execution time ended up being like 72 hours or something and this was an operation that had to be run once an hour. A coworker ended up rewriting the same python/pandas code as a pretty simple SQL query that finished in 5 minutes instead of 72 hours.

Anyways I'd recommend looking into storing your pandas dataframe in an actual database table. Django is built on a database (usually mySQL or Postgres) and pandas even has a function to directly insert your dataframe into the database dataframe.to_sql( 'database_connection_str' )! From there you can write the django code such that responses will make a single query to DB, fetch values and return data in timely manner.

like image 41
chill_turner Avatar answered Sep 19 '22 21:09

chill_turner