Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloading multiple file as parallel in R

Tags:

file

r

download

I am trying to download 460,000 files from ftp server ( which I got from the TRMM archive data). I made a list of all files and separated them into different jobs, but can any one help me how to run those jobs at the same time in R. Just an example what I have tried to do

my.list <-readLines("1998-2010.txt") # lists the ftp address of each file
job1 <- for (i in 1: 1000) { 
            download.file(my.list[i], name[i], mode = "wb")
        }
job2 <- for (i in 1001: 2000){ 
            download.file(my.list[i], name[i], mode = "wb")
        }
job3 <- for (i in 2001: 3000){ 
            download.file(my.list[i], name[i], mode = "wb")
        }

Now I m stuck on how to run all of the Jobs at the same time.

Appreciate your help

like image 677
Dipangkar Avatar asked Jan 21 '26 18:01

Dipangkar


2 Answers

Dont do that. Really. Dont. It won't be any faster because the limiting factor is going to be the network speed. You'll just end up with a large number of even slower downloads, and then the server will just give up and throw you off, and you'll end up with a large number of half-downloaded files.

Downloading multiple files will also increase the disk load since now your PC is trying to save a large number of files.

Here's another solution.

Use R (or some other tool, its one line of awk script starting from your list) to write an HTML file which just looks like this:

<a href="ftp://example.com/path/file-1.dat">file-1.dat</a>
<a href="ftp://example.com/path/file-2.dat">file-2.dat</a>

and so on. Now open this file in your web browser and use a download manager (eg DownThemAll for Firefox) and tell it to download all the links. You can specify how many simultaneous downloads, how many times to retry fails and so on with DownThemAll.

like image 143
Spacedman Avatar answered Jan 23 '26 07:01

Spacedman


A good option is to use the mclapply or parLapply from the builtin parallel package. You then make a function that accepts a list of files that need to be downloaded:

library(parallel)
dowload_list = function(file_list) {
       return(lapply(download.file(file_list)))
   }
list_of_file_lists = c(my_list[1:1000], my_list[1001:2000], etc)
mclapply(list_of_file_lists, download_list)

I think it is wise to first split up the big list of files into a set a sublists, as for each entry in the list fed to mclapply a process is spawned. If this list is big, and the processing time per item in the list small, the overhead of parallelisation is probably going to make the downloading slower in stead of faster.

Do note that mclapply only works on Linux, parLapply should also work fine under Windows.

like image 20
Paul Hiemstra Avatar answered Jan 23 '26 07:01

Paul Hiemstra



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!