how to merge several csv files in R by columns without loading them into memory completely

Question

imagine that I have a huge matrix (toy example)

dat <- matrix(1:100,ncol = 10) 
colnames(dat) <- paste0("X",1:ncol(dat))
dat
      X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
 [1,]  1 11 21 31 41 51 61 71 81  91
 [2,]  2 12 22 32 42 52 62 72 82  92
 [3,]  3 13 23 33 43 53 63 73 83  93
 [4,]  4 14 24 34 44 54 64 74 84  94
 [5,]  5 15 25 35 45 55 65 75 85  95
 [6,]  6 16 26 36 46 56 66 76 86  96
 [7,]  7 17 27 37 47 57 67 77 87  97

this matrix is divided into columns and written in several files

path <- paste0(getwd(),"/example/")
dir.create(path = path)
colum_id <- matrix(1:10,ncol = 2,byrow = T)
for(i in 1:nrow(colum_id)) write.csv(dat[,colum_id[i,]], file = paste0(path,i,".csv"), row.names = F)

files in a folder dir(path)

[1] "1.csv" "2.csv" "3.csv" "4.csv" "5.csv"

.. read files

paths <- paste0(path,dir(path))
lapply(paths , \(Path) head(read.csv(Path),2) )

[[1]]
  X1 X2
1  1 11
2  2 12

[[2]]
  X3 X4
1 21 31
2 22 32

[[3]]
  X5 X6
1 41 51
2 42 52

[[4]]
  X7 X8
1 61 71
2 62 72

[[5]]
  X9 X10
1 81  91
2 82  92

The question is, how can I combine in R all these files column by column without loading them completely into memory (because the files are very large) to get the whole file on disk as a result.

SamR · Accepted Answer

Appending a column means writing to the middle of a `csv` file

Appending columns makes sense in a columnar data format such as SQL or arrow. The suggestion in the comments to use SQL is a good one and here is an answer using RSQLite which does this.

In a row-based format, like csv, appending a column is essentially writing to the middle of a plain text file. This means loading the file from (at best) somewhere on the first line, and re-writing everything after that point.

Append rows to a `csv` file instead

The best way to write a large plain text file, without loading the entire contents into RAM, is one line at a time. In your case, that means:

Read one row at a time from each of the input files.
cbind() each piece of the row to make a complete row.
Write the row out as either (in the case of the first row) the first line of a new text file, or (for subsequent rows) a line appended to an existing text file.

This would simply be:

outfile <- "./example/big_out_file.csv"
N_ROWS <- 11
for (i in seq(0, N_ROWS - 1)) {
    line <- do.call(
        cbind,
        lapply(paths, \(f) read.table(f, sep = ",", skip = i, nrows = 1))
    )
    write.table(line, outfile, col.names = FALSE, append = TRUE, sep = ",", row.names = FALSE)
}

The key is setting append = TRUE to write.table(), to append one line at a time to the existing text file (or to create it where it does not exist).

I set N_ROWS to 11 here as we know the input had 10 rows and a header. If you do not know how many lines there are, you can check without reading the file into memory by using fpeek::peek_count_lines() to check the number of lines in the first chunk, i.e:

N_ROWS  <- fpeek::peek_count_lines(paths[1])

Or if you prefer there are other approaches in the answers to this question.

This approach has the advantage of not loading any of the input files into memory at once, which you indicated in the comments was your preference. It reads one line of each input file at a time. This means there are a lot more read operations, so it will be much slower than reading each file once, but that's the trade-off. Depending on your memory constraints, you may be able to speed things up by reading more than one line at a time. If you do this, make sure to read and write the header separately to avoid the following rows being coerced into character.

Output

We can check that, when read back in, the output is the same as dat, the original matrix:

m <- as.matrix(read.csv(outfile))
head(m, 2)
#      X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# [1,]  1 11 21 31 41 51 61 71 81  91
# [2,]  2 12 22 32 42 52 62 72 82  92
identical(dat, m) 
# [1] TRUE

how to merge several csv files in R by columns without loading them into memory completely

Tags:

r

mr.T

1 Answers

Appending a column means writing to the middle of a `csv` file

Append rows to a `csv` file instead

Output

SamR

Recent Activity

Donate For Us

how to merge several csv files in R by columns without loading them into memory completely

Tags:

r

mr.T

1 Answers

Appending a column means writing to the middle of a csv file

Append rows to a csv file instead

Output

SamR

Related questions

Recent Activity

Donate For Us

Appending a column means writing to the middle of a `csv` file

Append rows to a `csv` file instead