Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to read in 100,000 .dat.gz files

I have a few hundred thousand very small .dat.gz files that I want to read into R in the most efficient way possible. I read in the file and then immediately aggregate and discard the data, so I am not worried about managing memory as I get near the end of the process. I just really want to speed up the bottleneck, which happens to be unzipping and reading in the data.

Each dataset consists of 366 rows and 17 columns. Here is a reproducible example of what I am doing so far:

Building reproducible data:

require(data.table)

# Make dir
system("mkdir practice")

# Function to create data
create_write_data <- function(file.nm) {
  dt <- data.table(Day=0:365)
  dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rnorm(n=366))]
  write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="\t", quote=FALSE)
  system(paste0("gzip ./practice/", file.nm))    
}

And here is code applying:

# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))

And here is my most efficient code so far to read in the data:

# Function to read in files as fast as possible
read_Fast <- function(path.gz) {
  system(paste0("gzip -d ", path.gz)) # Unzip file
  path.dat <- gsub(".gz", "", path.gz)
  dat_run <- fread(path.dat)
}

# Apply above function
dat.files <- list.files(path="./practice", full.names = TRUE)
system.time(dat.list <- rbindlist(lapply(dat.files, read_Fast), fill=TRUE))
dat.list

I have bottled this up in a function and applied it in parallel, but it is still much much too slow for what I need this for.

I have already tried the h2o.importFolder from the wonderful h2o package, but it is actually much much slower compared to using plain R with data.table. Maybe there is a way to speed up the unzipping of files, but I am unsure. From the few times that I have run this, I have noticed that the unzipping of the files usually takes about 2/3rd of the function time.

like image 403
Mike.Gahan Avatar asked Mar 03 '16 05:03

Mike.Gahan


People also ask

How do I unzip a DAT GZ file?

To open (unzip) a . gz file, right-click on the file you want to decompress and select “Extract”. Windows users need to install additional software such as 7zip to open . gz files.

How do I read a .GZ file?

Launch WinZip from your start menu or Desktop shortcut. Open the compressed file by clicking File > Open. If your system has the compressed file extension associated with WinZip program, just double-click on the file.

What are DAT GZ files?

A GZ file is an archive file compressed by the standard GNU zip (gzip) compression algorithm. It typically contains a single compressed file but may also store multiple compressed files. gzip is primarily used on Unix operating systems for file compression. GZ file open in 7-Zip 19.

How do I open a GZ file without unzip?

Just use zcat to see content without extraction. From the manual: zcat is identical to gunzip -c . (On some systems, zcat may be installed as gzcat to preserve the original link to compress .)


1 Answers

I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.

tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl
like image 66
Clayton Stanley Avatar answered Sep 20 '22 06:09

Clayton Stanley