Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Python Celery, how do I persist objects across consecutive worker calls?

Tags:

python

celery

I'm using Celery to automate some screen scraping. I'm using Selenium to open up a Chrome webdriver, manipulate the page, save some data, and then move on to the next page in the queue. The problem is that it builds up and breaks down the web driver for every task in the queue, which is very time consuming and resource intensive.

How do I persist objects across calls? I've read some things about connection pooling in Celery, but it's not clear to me how exactly this works - where do I build up the webdriver - in the tasks file or in the main queueing file? If the latter, how do the workers know which webdriver to use?

Example:

scrape.py:

for page in list:  
  scrape.delay(str(row['product_id']), str(row['pg_code']))

tasks.py:

def scrape:
  # do some stuff
like image 579
jwoww Avatar asked Nov 05 '13 02:11

jwoww


People also ask

Can Celery run multiple workers?

The Celery worker then has to wait for every task before it starts execution. This demonstrates how Celery made use of Redis to distribute tasks across multiple workers and to manage the task queue.

Is Celery persistent?

Queues created by Celery are persistent by default. This means that the broker will write messages to disk to ensure that the tasks will be executed even if the broker is restarted.

How do you call Celery synchronously?

If you look at the celery DOCS on tasks you see that to call a task synchronosuly, you use the apply() method as opposed to the apply_async() method. The DOCS also note that: If the CELERY_ALWAYS_EAGER setting is set, it will be replaced by a local apply() call instead.

What is concurrency in Celery?

As for --concurrency celery by default uses multiprocessing to perform concurrent execution of tasks. The number of worker processes/threads can be changed using the --concurrency argument and defaults to the number of available CPU's if not set.


1 Answers

Since each worker instantiates the task as a singleton, you can cache the web driver in the task object. The documentation specifically suggests this approach.

http://docs.celeryproject.org/en/latest/userguide/tasks.html#instantiation

like image 167
joshua Avatar answered Oct 13 '22 00:10

joshua