I have a CSV file the first row of which contains the variables names and the rest of the rows contains the data. What's a good way to break it up into files each containing just one variable in R? Is this solution going to be robust? E.g. what if the input file is 100G in size?
The input files looks like
var1,var2,var3
1,2,hello
2,5,yay
...
I want to create 3 (or however many variables) files var1.csv, var2.csv, var3.csv so that files resemble File1
var1
1
2
...
File2
var2?
2
5
...
File3
var3
hello
yay
I got a solution in Python (How to break a large CSV data file into individual data files?) but I wonder if R can do the same thing? Essential the Python code reads the csv file line by line and then writes the lines out one at a time. Can R do the same? The command read.csv reads the whole file all at once and this can slow the whole process down. Plus it can't read a 100G file and process it as R attempts to read the whole file into memory. I can't find a command in R that let's you read a csv file line by line. Please help. Thanks!!
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
You can scan
and then write
to a file(s) one line at a time.
i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
length(x) > 1}) {
write(x[1], "file1.csv", sep = ",", append = T)
write(x[2], "file2.csv", sep = ",", append = T)
write(x[3], "file3.csv", sep = ",", append = T)
i <- i + 1
}
edit!! I am using the above data, copied over 1000 times. I've done a comparison of speed when we have the file connection open at all times.
ver1 <- function() {
i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
length(x) > 1}) {
write(x[1], "file1.csv", sep = ",", append = T)
write(x[2], "file2.csv", sep = ",", append = T)
write(x[3], "file3.csv", sep = ",", append = T)
i <- i + 1
}
}
system.time(ver1()) # w/ close to 3K lines of data, 3 columns
## user system elapsed
## 2.809 0.417 3.629
ver2 <- function() {
f <- file("file.csv", "r")
f1 <- file("file1.csv", "w")
f2 <- file("file2.csv", "w")
f3 <- file("file3.csv", "w")
while({x <- scan(f, sep = ",", skip = 0, nlines = 1, what = "character");
length(x) > 1}) {
write(x[1], file = f1, sep = ",", append = T, ncol = 1)
write(x[2], file = f2, sep = ",", append = T, ncol = 1)
write(x[3], file = f3, sep = ",", append = T, ncol = 1)
}
closeAllConnections()
}
system.time(ver2())
## user system elapsed
## 0.257 0.098 0.409
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With