I have a 370MB zip file and the content is a 4.2GB csv file.
I did:
unzip("year2015.zip", exdir = "csv_folder")
And I got this message:
1: In unzip("year2015.zip", exdir = "csv_folder") :
possible truncation of >= 4GB file
Have you experienced that before? How did you solve it?
I agree with @Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.
To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.
I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.
decompress_file <- function(directory, file, .file_cache = FALSE) {
if (.file_cache == TRUE) {
print("decompression skipped")
} else {
# Set working directory for decompression
# simplifies unzip directory location behavior
wd <- getwd()
setwd(directory)
# Run decompression
decompression <-
system2("unzip",
args = c("-o", # include override flag
file),
stdout = TRUE)
# uncomment to delete archive once decompressed
# file.remove(file)
# Reset working directory
setwd(wd); rm(wd)
# Test for success criteria
# change the search depending on
# your implementation
if (grepl("Warning message", tail(decompression, 1))) {
print(decompression)
}
}
}
Notes:
The function does a few things, which I like and recommend:
system2
over system because the documentation says "system2 is a more portable and flexible interface than system" directory
and file
arguments, and moves the working directory to the directory
argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
.file_cache
argument which allows you to skip decompression
if
+ grepl
check at the end looks for warnings in the stdout, and prints the stdout if it finds that expressionChecking ?unzip
, found the following comment in Note
:
It does have some support for bzip2 compression and > 2GB zip files (but not >= 4GB files pre-compression contained in a zip file: like many builds of unzip it may truncate these, in R's case with a warning if possible).
You can try to unzip it outside of R (using 7-Zip for example).
To add to the list of possible solutions, in case you have Java (JDK) available on your machine, you can wrap jar xf
into an R function similar to utils::unzip()
in interface, a very simple example:
unzipLarge <- function(zipfile, exdir = getwd()) {
oldWd <- getwd()
on.exit(setwd(oldWd))
setwd(exdir)
system2("jar", args = c("xf", zipfile))
}
And then use:
unzipLarge("year2015.zip", exdir = "csv_folder")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With