Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to break a large CSV data file into individual data files using R?

Tags:

r

csv

I have a CSV file the first row of which contains the variables names and the rest of the rows contains the data. What's a good way to break it up into files each containing just one variable in R? Is this solution going to be robust? E.g. what if the input file is 100G in size?

The input files looks like

var1,var2,var3
1,2,hello
2,5,yay
...

I want to create 3 (or however many variables) files var1.csv, var2.csv, var3.csv so that files resemble File1

var1
1
2
...

File2

var2?
2
5
...

File3

var3
hello
yay

I got a solution in Python (How to break a large CSV data file into individual data files?) but I wonder if R can do the same thing? Essential the Python code reads the csv file line by line and then writes the lines out one at a time. Can R do the same? The command read.csv reads the whole file all at once and this can slow the whole process down. Plus it can't read a 100G file and process it as R attempts to read the whole file into memory. I can't find a command in R that let's you read a csv file line by line. Please help. Thanks!!

like image 531
xiaodai Avatar asked Jul 31 '10 02:07

xiaodai


People also ask

How do I open a CSV file with more than 1 million rows?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.


1 Answers

You can scan and then write to a file(s) one line at a time.

i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
       length(x) > 1}) {
  write(x[1], "file1.csv", sep = ",", append = T)
  write(x[2], "file2.csv", sep = ",", append = T)
  write(x[3], "file3.csv", sep = ",", append = T)
  i <- i + 1
}

edit!! I am using the above data, copied over 1000 times. I've done a comparison of speed when we have the file connection open at all times.

ver1 <- function() {
  i <- 0
  while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], "file1.csv", sep = ",", append = T)
    write(x[2], "file2.csv", sep = ",", append = T)
    write(x[3], "file3.csv", sep = ",", append = T)
    i <- i + 1
  }
}

system.time(ver1()) # w/ close to 3K lines of data, 3 columns
##    user  system elapsed 
##   2.809   0.417   3.629 

ver2 <- function() {
  f <- file("file.csv", "r")
  f1 <- file("file1.csv", "w")
  f2 <- file("file2.csv", "w")
  f3 <- file("file3.csv", "w")
  while({x <- scan(f, sep = ",", skip = 0, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], file = f1, sep = ",", append = T, ncol = 1)
    write(x[2], file = f2, sep = ",", append = T, ncol = 1)
    write(x[3], file = f3, sep = ",", append = T, ncol = 1)
  } 
  closeAllConnections()
}

system.time(ver2())
##   user  system elapsed 
##   0.257   0.098   0.409 
like image 162
apeescape Avatar answered Oct 07 '22 13:10

apeescape