Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read.csv is extremely slow in reading csv files with large numbers of columns

Tags:

I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When I try to load this file with read.csv it turns out to be extremely slow. It is also very slow when I add a parameter nrow=100. I wonder if there is a way to accelerate the read.csv, or use some other function instead of read.csv to load the file into memory as a matrix or data.frame?

Thanks in advance.

like image 347
rninja Avatar asked Sep 07 '11 01:09

rninja


1 Answers

If your CSV only contains integers, you should use scan instead of read.csv, since ?read.csv says:

 ‘read.table’ is not the right tool for reading large matrices,  especially those with many columns: it is designed to read _data  frames_ which may have columns of very different classes.  Use  ‘scan’ instead for matrices. 

Since your file has a header, you will need skip=1, and it will probably be faster if you set what=integer(). If you must use read.csv and speed / memory consumption are a concern, setting the colClasses argument is a huge help.

like image 154
Joshua Ulrich Avatar answered Oct 02 '22 15:10

Joshua Ulrich