Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Heroku Memory quota vastly exceeded in Django Project using SKLearn

I have deployed a Django application on Heroku with the goal of allowing trusted, known, internal users upload a CSV file, click "Run", and behind the scenes, the Django app:

  1. loads a saved sklearn pipeline .pkl model (120 MB size, let's say)
  2. reads the user's CSV data using Pandas
  3. calls predict on the model using CSV data as input
  4. outputs file to S3 using Django Storages

This works for small CSV files, but causes Memory quota vastly exceeded if user uploads a large CSV file... and it makes sense that larger CSV files are going to increase memory consumption.

I'm not sure where to adjust. I'm wondering if someone out has experienced a similar scenario when deploying sklearn models and how they "solved" it?

Ideas I have are:

  1. Identify memory leaks? No idea really even where to start on this one. Django DEBUG is set to False.
  2. Change my celery task to process the inputted file in chunks instead?
  3. Make a smaller SKLearn pipeline file somehow with joblib (I use compress=1 already)?
  4. Increase Heroku dynos? workers?

My django models.py looks like this:

from django.db import models
from django.urls import reverse


class MLModel(models.Model):
    name = models.CharField(max_length=80)
    file = models.FileField(upload_to="models/")
    created = models.DateTimeField(auto_now_add=True)
    updated = models.DateTimeField(auto_now=True)

    def __str__(self):
        return self.name


class Upload(models.Model):
    name = models.CharField(max_length=100)
    mlmodel = models.ForeignKey(MLModel, on_delete=models.CASCADE)
    file = models.FileField(upload_to='data/')

    def __str__(self):
        return self.name

    def get_absolute_url(self):
        return reverse('edit', kwargs={'pk': self.pk})

My celery task looks like this:

@shared_task
def piparoo(id):
    instance = Upload.objects.get(id=id)
    model = joblib.load(instance.mlmodel.file.storage.open(instance.mlmodel.file.name))
    data = pd.read_csv(instance.file)
    data['Predicted'] = model.predict(data)

    buffer = StringIO()
    data.to_csv(buffer, index=False)
    content = buffer.getvalue().encode('utf-8')
    default_storage.save('output/results_{}.csv'.format(id), ContentFile(content))

Heroku logs:

2018-04-12T06:12:53.592922+00:00 app[worker.1]: [2018-04-12 06:12:53,592: INFO/MainProcess] Received task: predictions.tasks.piparoo[f1ca09e1-6bba-4115-8989-04bb32d4f08e]
2018-04-12T06:12:53.737378+00:00 heroku[router]: at=info method=GET path="/predict/" host=tdmpredict.herokuapp.com request_id=ffad9785-5cb6-4e3c-a87c-94cbca47d109 fwd="24.16.35.31" dyno=web.1 connect=0
ms service=33ms status=200 bytes=6347 protocol=https
2018-04-12T06:13:08.054486+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2018-04-12T06:13:08.054399+00:00 heroku[worker.1]: Process running mem=572M(111.9%)
2018-04-12T06:13:28.026973+00:00 heroku[worker.1]: Error R15 (Memory quota vastly exceeded)
2018-04-12T06:13:28.026765+00:00 heroku[worker.1]: Process running mem=1075M(210.1%)
2018-04-12T06:13:28.026973+00:00 heroku[worker.1]: Stopping process with SIGKILL
2018-04-12T06:13:28.187650+00:00 heroku[worker.1]: Process exited with status 137
2018-04-12T06:13:28.306221+00:00 heroku[worker.1]: State changed from up to crashed
like image 548
Jarad Avatar asked Apr 12 '18 19:04

Jarad


1 Answers

Solution that resolved my problem (in a common-sense way).

Instead of reading the user's CSV file into memory at once, process it in chunks using Pandas chunksize parameter then concatenate the list of dataframes into one at the end. I also delete the model (120 MB) in an attempt to free up that memory for future processes.

My celery task now looks like this:

@shared_task
def piparoo(id):
    instance = Upload.objects.get(id=id)
    model = joblib.load(instance.mlmodel.file.storage.open(instance.mlmodel.file.name))

    final = []
    for chunk in pd.read_csv(instance.file, chunksize=5000):
        chunk['Predicted'] = model.predict(chunk)
        final.append(chunk)

    del model
    final = pd.concat(final)

    buffer = StringIO()
    final.to_csv(buffer, index=False)
    content = buffer.getvalue().encode('utf-8')
    default_storage.save('output/results_{}.csv'.format(id), ContentFile(content))
like image 97
Jarad Avatar answered Oct 19 '22 14:10

Jarad