Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read large dataset in R [duplicate]

Tags:

r

large-data

Possible Duplicate:
Quickly reading very large tables as dataframes in R

Hi,

trying to read a large dataset in R the console displayed the follwing errors:

data<-read.csv("UserDailyStats.csv", sep=",", header=T, na.strings="-", stringsAsFactors=FALSE)
> data = data[complete.cases(data),]
> dataset<-data.frame(user_id=as.character(data[,1]),event_date= as.character(data[,2]),day_of_week=as.factor(data[,3]),distinct_events_a_count=as.numeric(as.character(data[,4])),total_events_a_count=as.numeric(as.character(data[,5])),events_a_duration=as.numeric(as.character(data[,6])),distinct_events_b_count=as.numeric(as.character(data[,7])),total_events_b=as.numeric(as.character(data[,8])),events_b_duration= as.numeric(as.character(data[,9])))
Error: cannot allocate vector of size 94.3 Mb
In addition: Warning messages:
1: In data.frame(user_msisdn = as.character(data[, 1]), calls_date = as.character(data[,  :
  NAs introduced by coercion
2: In data.frame(user_msisdn = as.character(data[, 1]), calls_date = as.character(data[,  :
  NAs introduced by coercion
3: In class(value) <- "data.frame" :
  Reached total allocation of 3583Mb: see help(memory.size)
4: In class(value) <- "data.frame" :
  Reached total allocation of 3583Mb: see help(memory.size)

Does anyone know how to read large datasets? The size of UserDailyStats.csv is approximately 2GB.

like image 680
Niko Gamulin Avatar asked Dec 07 '22 01:12

Niko Gamulin


2 Answers

Sure:

  1. Get a bigger computer, in particular more ram
  2. Run a 64-bit OS, see 1) about more ram now that you can use it
  3. Read only the columns you need
  4. Read fewer rows
  5. Read the data in binary rather than re-parsing 2gb (which is mighty inefficient).

There is also a manual for this at the R site.

like image 113
Dirk Eddelbuettel Avatar answered Dec 21 '22 12:12

Dirk Eddelbuettel


You could try specifying the data type in the read.csv call using colClasses.

data<-read.csv("UserDailyStats.csv", sep=",", header=T, na.strings="-", stringsAsFactors=FALSE, colClasses=c("character","character","factor",rep("numeric",6)))

Though with a dataset of this size it may still be problematic and there isn't a great deal of memory left for any analysis you may want to do. Adding RAM & using 64-bit computing would provide more flexibility.

like image 25
James Avatar answered Dec 21 '22 11:12

James