Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to serialize binary files to use with a celery task

I recently integrated celery (django-celery to be more specific) in one of my applications. I have a model in the application as follows.

class UserUploadedFile(models.Model)
    original_file = models.FileField(upload_to='/uploads/')    
    txt = models.FileField(upload_to='/uploads/')
    pdf = models.FileField(upload_to='/uploads/')
    doc = models.FileField(upload_to='/uploads/')
    
    def convert_to_others(self):
        # Code to convert the original file to other formats

Now, once a user uploads a file, i want to convert the original file to txt, pdf and doc formats. calling the convert_to_others method is a bit of an expensive process so i plan to do it asynchronously using celery. So i wrote a simple celery task as follows.

@celery.task(default_retry_delay=bdev.settings.TASK_RETRY_DELAY)
def convert_ufile(file, request):
    """ 
    This task method would call a UserUploadedFile object's convert_to_others
    method to do the file conversions.

    The best way to call this task would be doing it asynchronously
    using apply_async method.
    """
    try:
        file.convert_to_others()
    except Exception, err:
        # If the task fails log the exception and retry in 30 secs
        log.LoggingMiddleware.log_exception(request, err)
        convert_ufile.retry(exc=err)
    return True

and then called the task as follows:

ufile = get_object_or_404(models.UserUploadedFiles, pk=id)
tasks.convert_ufile.apply_async(args=[ufile, request])

Now when the apply_async method is called it raises the following exception:

PicklingError: Can't pickle <type 'cStringIO.StringO'>: attribute lookup cStringIO.StringO failed

I think this is because celery (by default) uses pickle library to serialize data, and pickle is not able to serialize the binary file.

Question

Are there any other serializers that can serialize a binary file on its own? If not how can i serialize a binary file using the default pickle serializer ?

like image 255
Amyth Avatar asked Jan 02 '13 07:01

Amyth


People also ask

What is Apply_async in Celery?

apply_async(args[, kwargs[, …]]) Sends a task message. delay(*args, **kwargs) Shortcut to send a task message, but doesn't support execution options.

Is Celery task ID unique?

It performs dual roles in that it defines both what happens when a task is called (sends a message), and what happens when a worker receives that message. Every task class has a unique name, and this name is referenced in messages so the worker can find the right function to execute.

How does Celery backend work?

The Results Backend delay places the task in the queue and returns a promise that can be used to monitor the status and get the result when it's ready. Calling get in that promise will block the execution until the result is available.


1 Answers

You are correct that celery tries to pickle data for which pickling is unsupported. Even if you would find a way to serialize data you want to send to celery task, I wouldn't do this.

It is always a good idea to send as little data as possible to the celery tasks, so in your case I would pass only the id of a UserUploadedFile instance. Having this you can fetch your object by id in celery task and perform convert_to_others() .

Please also note that the object could change its state (or it could even be deleted) before the task is executed. So it is much safer to fetch the object in your celery task instead of sending its full copy.

To sum up, sending only an instance id and refetching it in task gives you a few things:

  • You send less data to your queue.
  • You do not have to deal with data inconsistency issues.
  • It's actually possible in your case. :)

The only 'drawback' is that you need to perform an extra, inexpensive SELECT query to refetch your data, which in overall looks like a good deal, when compared to above issues, doesn't it?

like image 106
dzida Avatar answered Sep 19 '22 11:09

dzida