I have deployed a Django application on Heroku with the goal of allowing trusted, known, internal users upload a CSV file, click "Run", and behind the scenes, the Django app:
.pkl
model (120 MB size, let's say)predict
on the model using CSV data as inputThis works for small CSV files, but causes Memory quota vastly exceeded
if user uploads a large CSV file... and it makes sense that larger CSV files are going to increase memory consumption.
I'm not sure where to adjust. I'm wondering if someone out has experienced a similar scenario when deploying sklearn models and how they "solved" it?
Ideas I have are:
DEBUG
is set to False
.My django models.py looks like this:
from django.db import models
from django.urls import reverse
class MLModel(models.Model):
name = models.CharField(max_length=80)
file = models.FileField(upload_to="models/")
created = models.DateTimeField(auto_now_add=True)
updated = models.DateTimeField(auto_now=True)
def __str__(self):
return self.name
class Upload(models.Model):
name = models.CharField(max_length=100)
mlmodel = models.ForeignKey(MLModel, on_delete=models.CASCADE)
file = models.FileField(upload_to='data/')
def __str__(self):
return self.name
def get_absolute_url(self):
return reverse('edit', kwargs={'pk': self.pk})
My celery task looks like this:
@shared_task
def piparoo(id):
instance = Upload.objects.get(id=id)
model = joblib.load(instance.mlmodel.file.storage.open(instance.mlmodel.file.name))
data = pd.read_csv(instance.file)
data['Predicted'] = model.predict(data)
buffer = StringIO()
data.to_csv(buffer, index=False)
content = buffer.getvalue().encode('utf-8')
default_storage.save('output/results_{}.csv'.format(id), ContentFile(content))
Heroku logs:
2018-04-12T06:12:53.592922+00:00 app[worker.1]: [2018-04-12 06:12:53,592: INFO/MainProcess] Received task: predictions.tasks.piparoo[f1ca09e1-6bba-4115-8989-04bb32d4f08e]
2018-04-12T06:12:53.737378+00:00 heroku[router]: at=info method=GET path="/predict/" host=tdmpredict.herokuapp.com request_id=ffad9785-5cb6-4e3c-a87c-94cbca47d109 fwd="24.16.35.31" dyno=web.1 connect=0
ms service=33ms status=200 bytes=6347 protocol=https
2018-04-12T06:13:08.054486+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2018-04-12T06:13:08.054399+00:00 heroku[worker.1]: Process running mem=572M(111.9%)
2018-04-12T06:13:28.026973+00:00 heroku[worker.1]: Error R15 (Memory quota vastly exceeded)
2018-04-12T06:13:28.026765+00:00 heroku[worker.1]: Process running mem=1075M(210.1%)
2018-04-12T06:13:28.026973+00:00 heroku[worker.1]: Stopping process with SIGKILL
2018-04-12T06:13:28.187650+00:00 heroku[worker.1]: Process exited with status 137
2018-04-12T06:13:28.306221+00:00 heroku[worker.1]: State changed from up to crashed
Solution that resolved my problem (in a common-sense way).
Instead of reading the user's CSV file into memory at once, process it in chunks using Pandas chunksize
parameter then concatenate the list of dataframes into one at the end. I also delete the model (120 MB) in an attempt to free up that memory for future processes.
My celery task now looks like this:
@shared_task
def piparoo(id):
instance = Upload.objects.get(id=id)
model = joblib.load(instance.mlmodel.file.storage.open(instance.mlmodel.file.name))
final = []
for chunk in pd.read_csv(instance.file, chunksize=5000):
chunk['Predicted'] = model.predict(chunk)
final.append(chunk)
del model
final = pd.concat(final)
buffer = StringIO()
final.to_csv(buffer, index=False)
content = buffer.getvalue().encode('utf-8')
default_storage.save('output/results_{}.csv'.format(id), ContentFile(content))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With