I'm using Celery to automate some screen scraping. I'm using Selenium to open up a Chrome webdriver, manipulate the page, save some data, and then move on to the next page in the queue. The problem is that it builds up and breaks down the web driver for every task in the queue, which is very time consuming and resource intensive.
How do I persist objects across calls? I've read some things about connection pooling in Celery, but it's not clear to me how exactly this works - where do I build up the webdriver - in the tasks file or in the main queueing file? If the latter, how do the workers know which webdriver to use?
Example:
scrape.py:
for page in list:
scrape.delay(str(row['product_id']), str(row['pg_code']))
tasks.py:
def scrape:
# do some stuff
The Celery worker then has to wait for every task before it starts execution. This demonstrates how Celery made use of Redis to distribute tasks across multiple workers and to manage the task queue.
Queues created by Celery are persistent by default. This means that the broker will write messages to disk to ensure that the tasks will be executed even if the broker is restarted.
If you look at the celery DOCS on tasks you see that to call a task synchronosuly, you use the apply() method as opposed to the apply_async() method. The DOCS also note that: If the CELERY_ALWAYS_EAGER setting is set, it will be replaced by a local apply() call instead.
As for --concurrency celery by default uses multiprocessing to perform concurrent execution of tasks. The number of worker processes/threads can be changed using the --concurrency argument and defaults to the number of available CPU's if not set.
Since each worker instantiates the task as a singleton, you can cache the web driver in the task object. The documentation specifically suggests this approach.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#instantiation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With