Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a zipped .csv file in R

Tags:

r

csv

I have been trying hard to solve this, but I cannot get my head around how to read zipped .csv files in R. I could first unzip the files and then read them, but since the amount of unzipped data is around 22GB, I guess it is more practical to handle zipped files.

I basically have many .csv files, which I ZIPPED ONE BY ONE into single .7z files. Every file is named like: file1.csv, file2.csv, etc., which zipped became respectively: file1.csv.7z, file2.csv.7z, etc.

If I use the following command:

data <- read.table(unz("substn-20100101.csv.7z", "substn-20100101.csv"), nrows=10, header=T, quote="\"", sep=",")

I get the message:

Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") : cannot open zip file 'substn-20100101.7z'

Any help would be much appreciated, thank you in advance.

like image 937
bosspe Avatar asked Feb 17 '14 13:02

bosspe


2 Answers

First of all if your problem is RAM, as you said each file has 22G, using compressed files won't resolve your problems. After read.table, for example, all file will be loaded in memory. If you are using these files to some kind of modeling i advise you to look at ff and bigmemory packages.

Another solution is use Revolutions R that has an academic licence and you can use for free. Revolutions R provides Big Data capabilities and you can manage this files easily with packages like revoscaleR.

Even another solution is using Postgres + MADLib + PivotalR. After ingesting data at Postgres, use PivotalR package to access that data and do models with MADLib library, directly from R console.

BUT, if you are planing something that be done with chunks of data, summary for example, you can use the package iterators. I will provide an use case to show how this can be done. Get Airlines data, 1988, and follow this code:

> install.packages('iterators')
> library(iterators)
> con <- bzfile('1988.csv.bz2', 'r')

OK, now you have a connection to your file. Let's create an iterator:

> it <- ireadLines(con, n=1) ## read just one line from the connection (n=1)

Just to test:

> nextElem(it)

and you will see something like:

1 "1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI,273,NA,NA,0,NA,0,NA,NA,NA,NA,NA"

> nextElem(it) 

and you will see the next line, and so on. Be aware that you are reading a line at a time, so you are not loading all the file to RAM.

If you want to read line by line till the end of the file you can use

> tryCatch(expr=nextElem(it), error=function(e) return(FALSE))

for example. When the file ends it return a logical FALSE.

like image 106
Flavio Barros Avatar answered Nov 03 '22 01:11

Flavio Barros


If I understand the question correctly, at least on Windows OS, you could use 7-Zip Command-Line.

For the sake of simplicity put 7za.exe in your R working directory (and your 7zip files), create .bat file with the following text in it:

"7za e *.7z -y" 

...than in R you run the following code:

my_batch <- "your_bat_file_name.bat"
shell.exec(shQuote(paste(my_batch), type = "cmd"))

Than you just read.table()... It works for me.

like image 32
Miha Trošt Avatar answered Nov 03 '22 00:11

Miha Trošt