I want to download some pages from a website and I did it successfully using curl
but I was wondering if somehow curl
downloads multiple pages at a time just like most of the download managers do, it will speed up things a little bit. Is it possible to do it in curl
command line utility?
The current command I am using is
curl 'http://www...../?page=[1-10]' 2>&1 > 1.html
Here I am downloading pages from 1 to 10 and storing them in a file named 1.html
.
Also, is it possible for curl
to write output of each URL to separate file say URL.html
, where URL
is the actual URL of the page under process.
My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs
, which is capable of running a specified number of commands in subprocesses.
The one-liner I would use is, simply:
$ seq 1 10 | xargs -n1 -P2 bash -c 'i=$0; url="http://example.com/?page${i}.html"; curl -O -s $url'
This warrants some explanation. The use of -n 1
instructs xargs
to process a single input argument at a time. In this example, the numbers 1 ... 10
are each processed separately. And -P 2
tells xargs
to keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.
You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargs
is.
Update: I suspect that my example with xargs
could be improved (at least on Mac OS X and BSD with the -J
flag). With GNU Parallel, the command is a bit less unwieldy as well:
parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}
Well, curl
is just a simple UNIX process. You can have as many of these curl
processes running in parallel and sending their outputs to different files.
curl
can use the filename part of the URL to generate the local file. Just use the -O
option (man curl
for details).
You could use something like the following
urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here
for url in $urls; do
# run the curl job in the background so we can start another job
# and disable the progress bar (-s)
echo "fetching $url"
curl $url -O -s &
done
wait #wait for all background jobs to terminate
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With