Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading in large text files in r

Tags:

r

I want to read in a large ido file that had just under 110,000,000 rows and 8 columns. The columns are made up of 2 integer columns and 6 logical columns. The delimiter "|" is used in the file. I tried using read.big.matrix and it took forever. I also tried dumpDf and it ran out of RAM. I tried ff which I heard was a good package and I am struggling with errors. I would like to do some analysis with this table if I can read it in some way. If anyone has any suggestions that would be great. Kind Regards, Lorcan

like image 877
Lorcan Treanor Avatar asked Aug 02 '12 16:08

Lorcan Treanor


2 Answers

Thank you for all your suggestions. I managed to figure out why I couldn't get the error to work. I'll give you all answers and suggestions so no one can make my stupid mistake again.

First of all, the data that was been giving to me contained some errors in it so I was doomed to fail from the start. I was unaware until a colleague came across it in another piece of software. In a column that contained integers there were some letters so that when the read.table.ff package tried to read in the data set it somehow got confused or I don't know. Whatever though I was given another sample of data, 16,000,000 rows and 8 columns with correct entries and it worked perfectly. The code that I ran is as follows and took about 30 seconds to read:

setwd("D:/data test")
library(ff)
ffdf1 <- read.table.ffdf(file = "test.ido", header = TRUE,  sep = "|")

Thank you all for your time and if you have any questions about the answer feel free to ask and I will do my best to help.

like image 66
Lorcan Treanor Avatar answered Oct 16 '22 19:10

Lorcan Treanor


Do you really need all the data for your analysis? Maybe you could aggregate your dataset (say from minute values to daily averages). This aggregation only needs to be done once, and can hopefully be done in chunks. In this way you do need to load all your data into memory at once.

Reading in chunks can be done using scan, the important arguments are skip and n. Alternatively, put your data into a database and extract the chunks in that way. You could even using the functions from the plyr package to run chunks in parallel, see this blog post of mine for an example.

like image 32
Paul Hiemstra Avatar answered Oct 16 '22 19:10

Paul Hiemstra