Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel download using Curl command line utility

I want to download some pages from a website and I did it successfully using curl but I was wondering if somehow curl downloads multiple pages at a time just like most of the download managers do, it will speed up things a little bit. Is it possible to do it in curl command line utility?

The current command I am using is

curl 'http://www...../?page=[1-10]' 2>&1 > 1.html

Here I am downloading pages from 1 to 10 and storing them in a file named 1.html.

Also, is it possible for curl to write output of each URL to separate file say URL.html, where URL is the actual URL of the page under process.

like image 275
Ravi Gupta Avatar asked Dec 26 '11 08:12

Ravi Gupta


2 Answers

My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs, which is capable of running a specified number of commands in subprocesses.

The one-liner I would use is, simply:

$ seq 1 10 | xargs -n1 -P2 bash -c 'i=$0; url="http://example.com/?page${i}.html"; curl -O -s $url'

This warrants some explanation. The use of -n 1 instructs xargs to process a single input argument at a time. In this example, the numbers 1 ... 10 are each processed separately. And -P 2 tells xargs to keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.

You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargs is.

Update: I suspect that my example with xargs could be improved (at least on Mac OS X and BSD with the -J flag). With GNU Parallel, the command is a bit less unwieldy as well:

parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}
like image 80
ndronen Avatar answered Oct 15 '22 08:10

ndronen


Well, curl is just a simple UNIX process. You can have as many of these curl processes running in parallel and sending their outputs to different files.

curl can use the filename part of the URL to generate the local file. Just use the -O option (man curl for details).

You could use something like the following

urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here

for url in $urls; do
   # run the curl job in the background so we can start another job
   # and disable the progress bar (-s)
   echo "fetching $url"
   curl $url -O -s &
done
wait #wait for all background jobs to terminate
like image 31
nimrodm Avatar answered Oct 15 '22 09:10

nimrodm