Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass large chunk of data to celery

I am using celery worker for getting results from my machine learning models.

What I am doing is sending big numpy arrays(few megabytes) from client to celery task and back.

Currently I am serializing in client numpy arrays as base64. When i store/get the data directly from/to Redis on client or celery worker the performance of the system is much faster than/as when i let the celery do all passing of arguments(base64 of numpy).

I would like to use celery(with 'redis' broker) also for passing args/numpy arrays and not redis directly in client. Do you know where can be problem? How can i set the configuration of celery to do this more effectively(do the passing of data between client->broker->worker and back to client).

        serialized = np.asarray(images).reshape((number_of_records, size)).ravel().tostring()
        serialized = base64.b64encode(serialized)
        #self.redis.set(key, serialized)

        print('calling celery processor')
        result = self.celery.send_task('process', args=[number_of_records, serialized], kwargs={})
        returncode, result = result.get(timeout=1000, interval=0.1) 

vs(this is faster, directly usage of redis storage):

        serialized = np.asarray(images).reshape((number_of_records, size)).ravel().tostring()
        serialized = base64.b64encode(serialized)
        self.redis.set(key, serialized)

        print('calling celery processor')
        result = self.celery.send_task('process', args=[number_of_records, key], kwargs={})
        returncode, result = result.get(timeout=1000, interval=0.1) 

        resultc= self.redis.get(key)

Any tips for performance of celery for serialization, configuration settings, ...? I would like to have this system fast and simple. Should I really use redis directly as it is in second example?

like image 587
Cospel Avatar asked Jan 31 '17 11:01

Cospel


1 Answers

Celery uses JSON or cPickle to serialize messages. So what might be happening is you are serializing twice - first to base64 (which is inefficient) then to either JSON or cPickle.

Have you tried skipping the base64 encoding completely and just letting Celery handle it?

You can tell Celery to use cPickle (more efficient) instead of JSON (the default) with this code:

app.conf.task_serializer = 'pickle'
app.conf.result_serializer = 'pickle'
like image 58
geniass Avatar answered Nov 01 '22 05:11

geniass