Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error in reading files in R

Tags:

r

I'm a newcomer in the R community. Coding my first programs I've faced with a silly problem! When trying to read an RDS file with the following code:

tweets <- readRDS("RDataMining-Tweets-20160212.rds")

the following error will arise.

Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file 'RDataMining-Tweets-20160212.rds', probable reason 'No such file or directory'

What's the problem here?

like image 794
Eilia Avatar asked Nov 29 '22 06:11

Eilia


1 Answers

Since we don't have access to your file, it'll be difficult to really know for sure, so let me give you some examples of what other types files might give you.

First, some files:

ctypes <- list(FALSE, 'gzip', 'bzip2', 'xz')
saverds_names <- sprintf('saveRDS_%s.rds', ctypes)
save_names <- sprintf('save_%s.rda', ctypes)
ign <- mapply(function(fn,ct) saveRDS(mtcars, file=fn, compress=ct),
              saverds_names, ctypes)
ign <- mapply(function(fn,ct) save(mtcars, file=fn, compress=ct),
              save_names, ctypes)
str(lapply(saverds_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
#  $ : chr "saveRDS_FALSE.rds: data"
#  $ : chr "saveRDS_gzip.rds: gzip compressed data, from HPFS filesystem (OS/2, NT)"
#  $ : chr "saveRDS_bzip2.rds: bzip2 compressed data, block size = 900k"
#  $ : chr "saveRDS_xz.rds: XZ compressed data"
str(lapply(save_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
#  $ : chr "save_FALSE.rda: data"
#  $ : chr "save_gzip.rda: gzip compressed data, from HPFS filesystem (OS/2, NT)"
#  $ : chr "save_bzip2.rda: bzip2 compressed data, block size = 900k"
#  $ : chr "save_xz.rda: XZ compressed data"

A common (unix-y) utility is file, which uses file signatures to determine probable file type. (If you are on windows, it is usually installed with Rtools, so look for it there. If Sys.which("file") is empty, then look around for where you have Rtools installed, for something like c:\Rtools\bin\file.exe.)

Sys.which('file')
#                        file 
# "c:\\Rtools\\bin\\file.exe" 

With this, let's see what file thinks these files are likely to be:

str(lapply(saverds_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
#  $ : chr "saveRDS_FALSE.rds: data"
#  $ : chr "saveRDS_gzip.rds: gzip compressed data, from HPFS filesystem (OS/2, NT)"
#  $ : chr "saveRDS_bzip2.rds: bzip2 compressed data, block size = 900k"
#  $ : chr "saveRDS_xz.rds: XZ compressed data"
str(lapply(save_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
#  $ : chr "save_FALSE.rda: data"
#  $ : chr "save_gzip.rda: gzip compressed data, from HPFS filesystem (OS/2, NT)"
#  $ : chr "save_bzip2.rda: bzip2 compressed data, block size = 900k"
#  $ : chr "save_xz.rda: XZ compressed data"

Helps a little. If your does not return one of these four strings, then you are likely looking at a corrupted file (or mis-named, i.e., not really an .rds format that we are expecting).

If it does return one of them, though, know that readRDS (the first four) and load (last four) will automatically determine the compress= argument to use, which means that the file is most likely corrupt (or some other form of compressed data; again, likely mis-named).

In contrast, some other file types return these:

system2("file", "blank.accdb")
# blank.accdb: raw G3 data, byte-padded
system2("file", "Book1.xlsx")
# Book1.xlsx: Zip archive data, at least v2.0 to extract
system2("file", "Book1.xls")
# Book1.xls: OLE 2 Compound Document
system2("file", "j.fra.R") # code for this answer
# j.fra.R: ASCII text, with CRLF line terminators

(The with CRLF is a windows-y thing. *sigh*) The last one also will also be the case for CSV and similar text-based tabular files, etc.

@divibisan's suggestion that the file could be corrupt is the most likely culprit, in my mind, but it might give different output:

file.size(saverds_names[1])
# [1] 3798
head(readRDS(rawConnection(readBin(saverds_names[1], what=raw(1)))), n=2)
#               mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
# Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

but incomplete data from truncated files looks different: I truncated the files (externally with dd), and the error I received as "Error in readRDS: error reading from connection\n".

Looking at the source for R, that error string is only present in R_gzread, suggesting that R thinks the file is compressed with "gzip" (which is the default, perhaps because it could not positively identify any other obvious compression method).

This isn't much of an answer, but it might give you some appreciation for what could be wrong. The bottom line, unfortunately, is that it is highly unlikely to be able to recover any data from a corrupted.

like image 176
r2evans Avatar answered Dec 12 '22 15:12

r2evans