I'm trying to make a scrapy scraper work using cloud run. The main idea is that every 20 minutes a cloud scheduler cron should trigger the web scraper and get data from different sites. All sites have the same structure, so I would like to use same code and parallelize the execution of the scraping job, doing something like scrapy crawl scraper -a site=www.site1.com and scrapy crawl scraper -a site=www.site2.com.
I have already deployed a version of the scraper, but it only can do scrapy crawl scraper. How can I do that at execution the command's site change?
Also, should I be using cloud run job or service?
According to that page of documentation, there is a trick.
CLOUD_RUN_TASK_INDEX environment variable. That variable indicate the number of the task in the execution. For each different number, pick a line in your file of websites (the number of the line equal to the env var value).Like that, you can leverage Cloud Run jobs and parallelism.
The main tradeoff here is the static form of the websites list to scrap.
You can pass in overrides. For example, when calling a job from the cloud cli sdk, you can pass args, which can contain alternative args to be passed to your script
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With