Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to download <very large> number of pages from a list of urls?

I have a >100,000 urls (different domains) in a list that I want to download and save in a database for further processing and tinkering.

Would it be wise to use scrapy instead of python's multiprocessing / multithreading? If yes, how do I write a standalone script to do the same?

Also, feel free to suggest other awesome approaches that come to your mind.

like image 899
Anuvrat Parashar Avatar asked Nov 03 '22 20:11

Anuvrat Parashar


1 Answers

Scrapy does not seem relevant here if you know very well the URL to fetch (there's is no crawling involved here).

The easiest way that comes to mind would be to use Requests. However, querying each URL in a sequence and block waiting for answers wouldn't be efficient, so you could consider GRequests to send batches of requests asynchronously.

like image 131
icecrime Avatar answered Nov 10 '22 17:11

icecrime