What is the best way to download number of pages from a list of urls?

Question

I have a >100,000 urls (different domains) in a list that I want to download and save in a database for further processing and tinkering.

Would it be wise to use scrapy instead of python's multiprocessing / multithreading? If yes, how do I write a standalone script to do the same?

Also, feel free to suggest other awesome approaches that come to your mind.

icecrime · Accepted Answer

Scrapy does not seem relevant here if you know very well the URL to fetch (there's is no crawling involved here).

The easiest way that comes to mind would be to use Requests. However, querying each URL in a sequence and block waiting for answers wouldn't be efficient, so you could consider GRequests to send batches of requests asynchronously.

What is the best way to download <very large> number of pages from a list of urls?

Tags:

python

multithreading

multiprocessing

scrapy

web-crawler

Anuvrat Parashar

1 Answers

icecrime

Recent Activity

Donate For Us