I have 15 million CSV files, each with two columns (integer and float), and between 5 and 500 rows. Each file looks something like:
3453,0.034
31,0.031
567,0.456
...
Currently, I am iterating over all the files, and using read.csv()
to import each file into a big list. Here's a simplified version:
allFileNames = Sys.glob(sprintf("%s/*/*/results/*/*", dir))
s$scores = list()
for (i in 1:length(allFileNames)){
if ((i %% 1000) == 0){
cat(sprintf("%d of %d\n", i, length(allFileNames)))
}
fileName = allFileNames[i]
approachID = getApproachID(fileName)
bugID = getBugID(fileName)
size = file.info(fileName)$size
if (!is.na(size) && size > 0){ # make sure file exists and is not empty
tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric"))
colnames(tmp) = c("fileCode", "score")
s$scores[[approachID]][[bugID]] = tmp
} else {
# File does not exist, or is empty.
s$scores[[approachID]][[bugID]] = matrix(-1, ncol=2, nrow=1)
}
}
tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric")
Later in my code, I go back through each matrix in the list, and calculate some metrics.
After starting this import process, it looks like it will take on the order of 3 to 5 days to complete. Is there a faster way to do this?
EDIT: I added more details about my code.
Using readr Package You can consider this as a third option to load multiple CSV files into R DataFrame, This method uses the read_csv() function readr package. readr is a third-party library hence, in order to use readr library, you need to first install it by using install. packages('readr') .
For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .
table implementation is very fast) vroom is a bit slower than fread for pure numeric data. However because vroom is multi-threaded it is a bit quicker than readr and read.
R Objects live in memory entirely. Not possible to index objects with huge numbers of rows & columns even in 64 bit systems (2 Billion vector index limit) . Hits file size limit around 2-4 GB.
I'm not clear on your goal, but if you're trying to read all of these files into a single R data structure, then I see two major performance concerns:
So do some quick profiling and see how long the reads are taking. If they're slowing down progressively as you read in more files, then let's focus on problem #2. If it's constantly slow, then let's worry about problem #1.
Regarding solutions, I'd say you could start with two things:
scan()
or readlines()
functions.Add some more details of your code (what does the list look like that you're using?) and we may be able to be more helpful.
Using scan
(as Joshua state in comment) could be faster (3-4 times):
scan(fileName, what=list(0L,0.0), sep=",", dec=".", quiet=TRUE)
Main difference is that scan
returns list with two elements and read.csv
returns data.frame
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With