Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect newly added files in a distant folder over the web in R?

How can I write an R script running on a server, that would detect whenever a new CSV file (or other specific format) is added in a distant folder over the web, and download it automatically ?

Example folder: https://ftp.ncbi.nlm.nih.gov/pub/pmc/

As soon as a new CSV file is added in this folder, I want to download it right away to process it locally.

like image 889
alp Avatar asked Jan 25 '23 01:01

alp


1 Answers

I know the OP was looking for an "event listener" to monitor for changes on the file server, but some message has to be sent from the distant computer to notify your computer of the change. If you have no control over the file server, the only way to get it to send you a message is by first sending it a request. This means the only general "event listener" available is one that works by intermittently polling the server.

Depending on how frequently you poll, this should work perfectly well as an event listener. As an analogy, many species of bats hunt by sending out intermittent pulses of ultrasound and listening for the response. This is a form of intermittent polling that works well enough to keep them alive.

This does mean having to have some sort of software running in the background on your own computer. Your two options here are to use scheduling to run the R script intermittently, or to run an R script in the background that loops with a pause in between polls.

It appears from comments that the OP only wants to download any new files added to the server, but not to create copies of existing files when the program is first run. This means that a file has to be stored locally listing the contents of the ftp directory the last time it was checked, then comparing this to the current contents of the ftp directory, and downloading any new files as well as updating the content record.

Here's a function that does just that. The first time you run it, it will create a new local directory named after the hosting url and a .csv file with a listing of the directory at that point. Subsequent calls to the function after this will compare the contents of the local and remote directories and download any new files:

local_mirror <- function(url, root_dir = path.expand("~/"), silent = FALSE)
{
  if(substring(root_dir, nchar(root_dir), nchar(root_dir)) != "/")
    root_dir <- paste0(root_dir, "/")
  content <- rvest::html_nodes(xml2::read_html(url), "a")
  links <- rvest::html_attr(content, "href")
  links <- grep("/", links, invert = TRUE, value = TRUE)
  rel_path <- strsplit(url, "//")[[1]][2]
  mirror_path <- paste0(root_dir, rel_path)
  if(!dir.exists(mirror_path))
  {
    build_path <- root_dir
    for(i in strsplit(rel_path, "/")[[1]])
    {
      build_path <- paste0(build_path, i, "/")
      dir.create(build_path)
    }
    write.csv(links, paste0(mirror_path, ".mirrordat.csv"))
  }
  records <- read.csv(paste0(mirror_path, ".mirrordat.csv"), stringsAsFactors = FALSE)
  current_files <- records$x
  n_updated <- 0
  if(!silent) cat("Updating files - please wait")
  for(i in seq_along(links))
  {
    if(!(links[i] %in% current_files))
    {
      download.file(paste0(url, links[i]), paste0(mirror_path, links[i]))
      n_updated <- n_updated + 1
    }
  }
  if(!silent) message(paste("Downloaded", n_updated, "files"))
  write.csv(links, paste0(mirror_path, ".mirrordat.csv"))
}

To run the function in your case, you would just run:

local_mirror("https://ftp.ncbi.nlm.nih.gov/pub/pmc/")

and to run it as a constant "event monitor" in the background, you would place it inside a looping function like this:

listen_for_changes <- function(url, poll_every = 5, silent = TRUE)
{
  repeat
  {
    local_mirror(url, silent = silent)
    Sys.sleep(poll_every)
  }
}

Which you would just run with:

listen_for_changes("https://ftp.ncbi.nlm.nih.gov/pub/pmc/")
like image 55
Allan Cameron Avatar answered Jan 29 '23 23:01

Allan Cameron