Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I create a progress bar for data loading in R?

Tags:

Is it possible to create a progress bar for data loaded into R using load()?

For a data analysis project large matrices are being loaded in R from .RData files, which take several minutes to load. I would like to have a progress bar to monitor how much longer it will be before the data is loaded. R already has nice progress bar functionality integrated, but load() has no hooks for monitoring how much data has been read. If I can't use load directly, is there an indirect way I can create such a progress bar? Perhaps loading the .RData file in chucks and putting them together for R. Does any one have any thoughts or suggestions on this?

like image 585
Nixuz Avatar asked May 28 '11 23:05

Nixuz


People also ask

Is there a ProgressBar in R?

However, it would be nice to get some idea of how long you'll be waiting. A progress indicator for the R console is what you need. The txtProgressBar() command can help you here. The command allows you to set-up a progress indicator that displays in the R console and shows your progress towards the “end point”.

How do I track progress in R?

The easiest way to handle progress tracking in R is to periodically print the percentage of completion to the output, which is the screen by default, or write it to your log file located somewhere on the disk.

How do I add a shiny ProgressBar in R?

The simplest way to add a progress indicator is to put withProgress() inside of the reactive() , observer() , or renderXx() function that contains the long-running computation. In this example, we'll simulate a long computation by creating an empty data frame and then adding one row to it every 0.1 seconds.


2 Answers

I came up with the following solution, which will work for file sizes less than 2^32 - 1 bytes.

The R object needs to be serialized and saved to a file, as done by the following code.

saveObj <- function(object, file.name){     outfile <- file(file.name, "wb")     serialize(object, outfile)     close(outfile) } 

Then we read the binary data in chunks, keeping track of how much is read and updating the progress bar accordingly.

loadObj <- function(file.name){     library(foreach)     filesize <- file.info(file.name)$size     chunksize <- ceiling(filesize / 100)     pb <- txtProgressBar(min = 0, max = 100, style=3)     infile <- file(file.name, "rb")     data <- foreach(it = icount(100), .combine = c) %do% {         setTxtProgressBar(pb, it)         readBin(infile, "raw", chunksize)     }     close(infile)     close(pb)     return(unserialize(data)) } 

The code can be run as follows:

> a <- 1:100000000 > saveObj(a, "temp.RData") > b <- loadObj("temp.RData")   |======================================================================| 100% > all.equal(b, a) [1] TRUE 

If we benchmark the progress bar method against reading the file in a single chunk we see the progress bar method is slightly slower, but not enough to worry about.

> system.time(unserialize(readBin(infile, "raw", file.info("temp.RData")$size)))    user  system elapsed   2.710   0.340   3.062 > system.time(b <- loadObj("temp.RData"))   |======================================================================| 100%    user  system elapsed   3.750   0.400   4.154 

So while the above method works, I feel it is completely useless because of the file size restrictions. Progress bars are only useful for large files that take a long time to read in.

It would be great if someone could come up with something better than this solution!

like image 92
Nixuz Avatar answered Oct 21 '22 22:10

Nixuz


Might I instead suggest speeding up the load (and save) times so that a progress bar isn't needed? If reading one matrix is "fast", you could then potentially report progress between each read matrix (if you have many).

Here's some measurements. By simply setting compress=FALSE, the load speed is doubled. But by writing a simple matrix serializer, the load speed is almost 20x faster.

x <- matrix(runif(1e7), 1e5) # Matrix with 100k rows and 100 columns  system.time( save('x', file='c:/foo.bin') ) # 13.26 seconds system.time( load(file='c:/foo.bin') ) # 2.03 seconds  system.time( save('x', file='c:/foo.bin', compress=FALSE) ) # 0.86 seconds system.time( load(file='c:/foo.bin') ) # 0.92 seconds  system.time( saveMatrix(x, 'c:/foo.bin') ) # 0.70 seconds system.time( y <- loadMatrix('c:/foo.bin') ) # 0.11 seconds !!! identical(x,y) 

Where saveMatrix/loadMatrix are defined as follows. They don't currently handle dimnames and other attributes, but that could easily be added.

saveMatrix <- function(m, fileName) {     con <- file(fileName, 'wb')     on.exit(close(con))     writeBin(dim(m), con)     writeBin(typeof(m), con)     writeBin(c(m), con) }  loadMatrix <- function(fileName) {     con <- file(fileName, 'rb')     on.exit(close(con))     d <- readBin(con, 'integer', 2)     type <- readBin(con, 'character', 1)     structure(readBin(con, type, prod(d)), dim=d) } 
like image 22
Tommy Avatar answered Oct 21 '22 23:10

Tommy