imagine that I have a huge matrix (toy example)
dat <- matrix(1:100,ncol = 10)
colnames(dat) <- paste0("X",1:ncol(dat))
dat
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
[1,] 1 11 21 31 41 51 61 71 81 91
[2,] 2 12 22 32 42 52 62 72 82 92
[3,] 3 13 23 33 43 53 63 73 83 93
[4,] 4 14 24 34 44 54 64 74 84 94
[5,] 5 15 25 35 45 55 65 75 85 95
[6,] 6 16 26 36 46 56 66 76 86 96
[7,] 7 17 27 37 47 57 67 77 87 97
this matrix is divided into columns and written in several files
path <- paste0(getwd(),"/example/")
dir.create(path = path)
colum_id <- matrix(1:10,ncol = 2,byrow = T)
for(i in 1:nrow(colum_id)) write.csv(dat[,colum_id[i,]], file = paste0(path,i,".csv"), row.names = F)
files in a folder dir(path)
[1] "1.csv" "2.csv" "3.csv" "4.csv" "5.csv"
.. read files
paths <- paste0(path,dir(path))
lapply(paths , \(Path) head(read.csv(Path),2) )
[[1]]
X1 X2
1 1 11
2 2 12
[[2]]
X3 X4
1 21 31
2 22 32
[[3]]
X5 X6
1 41 51
2 42 52
[[4]]
X7 X8
1 61 71
2 62 72
[[5]]
X9 X10
1 81 91
2 82 92
The question is, how can I combine in R all these files column by column without loading them completely into memory (because the files are very large) to get the whole file on disk as a result.
csv fileAppending columns makes sense in a columnar data format such as SQL or arrow. The suggestion in the comments to use SQL is a good one and here is an answer using RSQLite which does this.
In a row-based format, like csv, appending a column is essentially writing to the middle of a plain text file. This means loading the file from (at best) somewhere on the first line, and re-writing everything after that point.
csv file insteadThe best way to write a large plain text file, without loading the entire contents into RAM, is one line at a time. In your case, that means:
cbind() each piece of the row to make a complete row.This would simply be:
outfile <- "./example/big_out_file.csv"
N_ROWS <- 11
for (i in seq(0, N_ROWS - 1)) {
line <- do.call(
cbind,
lapply(paths, \(f) read.table(f, sep = ",", skip = i, nrows = 1))
)
write.table(line, outfile, col.names = FALSE, append = TRUE, sep = ",", row.names = FALSE)
}
The key is setting append = TRUE to write.table(), to append one line at a time to the existing text file (or to create it where it does not exist).
I set N_ROWS to 11 here as we know the input had 10 rows and a header. If you do not know how many lines there are, you can check without reading the file into memory by using fpeek::peek_count_lines() to check the number of lines in the first chunk, i.e:
N_ROWS <- fpeek::peek_count_lines(paths[1])
Or if you prefer there are other approaches in the answers to this question.
This approach has the advantage of not loading any of the input files into memory at once, which you indicated in the comments was your preference. It reads one line of each input file at a time. This means there are a lot more read operations, so it will be much slower than reading each file once, but that's the trade-off. Depending on your memory constraints, you may be able to speed things up by reading more than one line at a time. If you do this, make sure to read and write the header separately to avoid the following rows being coerced into character.
We can check that, when read back in, the output is the same as dat, the original matrix:
m <- as.matrix(read.csv(outfile))
head(m, 2)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# [1,] 1 11 21 31 41 51 61 71 81 91
# [2,] 2 12 22 32 42 52 62 72 82 92
identical(dat, m)
# [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With