Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Manipulation of Large Files in R

Tags:

sql

r

data.table

I have 15 files of data, each around 4.5GB. Each file is a months worth of data for around 17,000 customers. All together, the data represents information on 17,000 customers over the course of 15 months. I want to reformat this data so that, instead of 15 files each denoting a month, I have 17,000 files for each customer and all their data. I wrote a script to do this:

#the variable 'files' is a vector of locations of the 15 month files
exists = NULL  #This vector keeps track of customers who have a file created for them
for (w in 1:15){  #for each of the 15 month files
  month = fread(files[w],select = c(2,3,6,16))  #read in the data I want
  custlist = unique(month$CustomerID) #a list of all customers in this month file
  for (i in 1:length(custlist)){ #for each customer in this month file
    curcust = custlist[i] #the current customer
    newchunk = subset(month,CustomerID == curcust) #all the data for this customer
    filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
    if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
      custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
      custfile$V1 = NULL #remove an extra column the fread adds
      custfile= rbind(custfile,newchunk)#combine read in data with our new data
      write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
    } else { #if it has not been created, write newchunk to a csv
      write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
      exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
    }
  }
 }

The script works (At least, I'm pretty sure). The problem is that it is incredibly slow. At the rate I'm going, it's going to take a week or more to finish, and I don't have that time. Do any of you a better, faster way to do this in R? Should I try to do this in something like SQL? I've never really used SQL before; could any of you show me how something like this would be done? Any input is greatly appreciated.

like image 288
Ore M Avatar asked Apr 12 '15 18:04

Ore M


2 Answers

As the @Dominic Comtois I would also recommend to use SQL.
R can handle quite a biggish data - there is nice benchmark of 2 billions rows which beats python - but because R run mostly in memory you need to have a good machine to make it work. Still your case don't need to load more than 4.5GB file at once so it should be well doable on personal computer, see second approach for fast non-database solution.
You can utilize R to load data to SQL database and later to query them from database. If you don't know SQL you may want to use some simple database. The simplest way from R is to use RSQLite (unfortunately since v1.1 it is not lite any more). You don't need to install or manage any external dependency. The RSQLite package contains the database engine embedded.

library(RSQLite)
library(data.table)
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db")
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
  dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE)
  cat("data for",monthfile,"loaded to db\n")
}
# query data
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1")
# when working with bigger sets of data I would recommend to do below
setDT(df)
dbDisconnect(conn)

Thats all. You use SQL without really having to do much overhead usually related to databases.

If you prefer to go with the approach from your post I think you can dramatically speed up by doing write.csv by groups while aggregation in data.table.

library(data.table)
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
  fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID]
  cat("data for",monthfile,"written to csv\n")
}

So you utilize fast unique from data.table and perform subsetting while grouping which is also ultra fast. Below is working example of the approach.

library(data.table)
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]

Update 2016-12-05:
Starting from data.table 1.9.8+ you can replace write.csv with fwrite, example in this answer.

like image 58
jangorecki Avatar answered Sep 16 '22 16:09

jangorecki


I think you already have your answer. But to reinforce it, see the official Doc

R Data Import Export

That states

In general, statistical systems like R are not particularly well suited to manipulations of large-scale data. Some other systems are better than R at this, and part of the thrust of this manual is to suggest that rather than duplicating functionality in R we can make another system do the work! (For example Therneau & Grambsch (2000) commented that they preferred to do data manipulation in SAS and then use package survival in S for the analysis.) Database manipulation systems are often very suitable for manipulating and extracting data: several packages to interact with DBMSs are discussed here.

So clearly storage of massive data is not R's primary strength, yet it provides interfaces to several tools specialized for this. In my own work, the lightweight SQLite solution is enough, even if it's a matter of preference, to some extent. Search for "drawbacks of using SQLite" and you probably won't find much to dissuade you.

You should find SQLite's documentation pretty smooth to follow. If you have enough programming experience, doing a tutorial or two should get you going pretty quickly on the SQL front. I don't see anything overly complicated going on in your code, so the most common & basic queries such as CREATE TABLE, SELECT ... WHERE will likely meet all your needs.

Edit

Another advantage of using a DBMS that I didn't mention is that you can have views that make easily accessible other data organization schemas if one might say. By creating views, you can go back to the "visualization by month" without having to rewrite any table nor duplicate any data.

like image 39
Dominic Comtois Avatar answered Sep 18 '22 16:09

Dominic Comtois