I have 15 files of data, each around 4.5GB. Each file is a months worth of data for around 17,000 customers. All together, the data represents information on 17,000 customers over the course of 15 months. I want to reformat this data so that, instead of 15 files each denoting a month, I have 17,000 files for each customer and all their data. I wrote a script to do this:
#the variable 'files' is a vector of locations of the 15 month files
exists = NULL #This vector keeps track of customers who have a file created for them
for (w in 1:15){ #for each of the 15 month files
month = fread(files[w],select = c(2,3,6,16)) #read in the data I want
custlist = unique(month$CustomerID) #a list of all customers in this month file
for (i in 1:length(custlist)){ #for each customer in this month file
curcust = custlist[i] #the current customer
newchunk = subset(month,CustomerID == curcust) #all the data for this customer
filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
custfile$V1 = NULL #remove an extra column the fread adds
custfile= rbind(custfile,newchunk)#combine read in data with our new data
write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
} else { #if it has not been created, write newchunk to a csv
write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
}
}
}
The script works (At least, I'm pretty sure). The problem is that it is incredibly slow. At the rate I'm going, it's going to take a week or more to finish, and I don't have that time. Do any of you a better, faster way to do this in R? Should I try to do this in something like SQL? I've never really used SQL before; could any of you show me how something like this would be done? Any input is greatly appreciated.
As the @Dominic Comtois I would also recommend to use SQL.
R can handle quite a biggish data - there is nice benchmark of 2 billions rows which beats python - but because R run mostly in memory you need to have a good machine to make it work. Still your case don't need to load more than 4.5GB file at once so it should be well doable on personal computer, see second approach for fast non-database solution.
You can utilize R to load data to SQL database and later to query them from database.
If you don't know SQL you may want to use some simple database. The simplest way from R is to use RSQLite (unfortunately since v1.1 it is not lite any more). You don't need to install or manage any external dependency. The RSQLite package contains the database engine embedded.
library(RSQLite)
library(data.table)
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db")
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE)
cat("data for",monthfile,"loaded to db\n")
}
# query data
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1")
# when working with bigger sets of data I would recommend to do below
setDT(df)
dbDisconnect(conn)
Thats all. You use SQL without really having to do much overhead usually related to databases.
If you prefer to go with the approach from your post I think you can dramatically speed up by doing write.csv
by groups while aggregation in data.table.
library(data.table)
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID]
cat("data for",monthfile,"written to csv\n")
}
So you utilize fast unique from data.table and perform subsetting while grouping which is also ultra fast. Below is working example of the approach.
library(data.table)
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]
Update 2016-12-05:
Starting from data.table 1.9.8+ you can replace write.csv
with fwrite
, example in this answer.
I think you already have your answer. But to reinforce it, see the official Doc
R Data Import Export
That states
In general, statistical systems like R are not particularly well suited to manipulations of large-scale data. Some other systems are better than R at this, and part of the thrust of this manual is to suggest that rather than duplicating functionality in R we can make another system do the work! (For example Therneau & Grambsch (2000) commented that they preferred to do data manipulation in SAS and then use package survival in S for the analysis.) Database manipulation systems are often very suitable for manipulating and extracting data: several packages to interact with DBMSs are discussed here.
So clearly storage of massive data is not R's primary strength, yet it provides interfaces to several tools specialized for this. In my own work, the lightweight SQLite solution is enough, even if it's a matter of preference, to some extent. Search for "drawbacks of using SQLite" and you probably won't find much to dissuade you.
You should find SQLite's documentation pretty smooth to follow. If you have enough programming experience, doing a tutorial or two should get you going pretty quickly on the SQL front. I don't see anything overly complicated going on in your code, so the most common & basic queries such as CREATE TABLE, SELECT ... WHERE will likely meet all your needs.
Edit
Another advantage of using a DBMS that I didn't mention is that you can have views
that make easily accessible other data organization schemas
if one might say. By creating views, you can go back to the "visualization by month" without having to rewrite any table nor duplicate any data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With