Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wget with sleep for friendly crawl

How to download from a list of urls and initiate pauses between each downloads?

I have a list of urls in url.txt, e.g.

http://manuals.info.apple.com/cs_CZ/Apple_TV_2nd_gen_Setup_Guide_cz.pdf
http://manuals.info.apple.com/cs_CZ/apple_tv_3rd_gen_setup_cz.pdf
http://manuals.info.apple.com/cs_CZ/imac_late2012_quickstart_cz.pdf
http://manuals.info.apple.com/cs_CZ/ipad_4th-gen-ipad-mini_info_cz.pdf
http://manuals.info.apple.com/cs_CZ/iPad_iOS4_Important_Product_Info_CZ.pdf
http://manuals.info.apple.com/cs_CZ/iPad_iOS4_Uzivatelska_prirucka.pdf
http://manuals.info.apple.com/cs_CZ/ipad_ios5_uzivatelska_prirucka.pdf
http://manuals.info.apple.com/cs_CZ/ipad_ios6_user_guide_cz.pdf
http://manuals.info.apple.com/cs_CZ/ipad_uzivatelska_prirucka.pdf

And I tried wget -i url.txt but it stops after a while because the server is detecting unfriendly crawling.

How do I put pauses between each url?

How do I do it with scrapy?

like image 735
alvas Avatar asked Sep 18 '14 10:09

alvas


2 Answers

wget

wget --wait=10 --random-wait --input-file=url.txt

scrapy

scrapy crawl yourbot -s DOWNLOAD_DELAY=10 -s RANDOMIZE_DOWNLOAD_DELAY=1
like image 188
kev Avatar answered Nov 12 '22 09:11

kev


You can add some delay between each request with -w or --wait options.

     -w seconds or --wait=seconds
like image 33
Tasawer Nawaz Avatar answered Nov 12 '22 09:11

Tasawer Nawaz