Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data analysis using R/python and SSDs

Does anyone have any experience using r/python with data stored in Solid State Drives. If you are doing mostly reads, in theory this should significantly improve the load times of large datasets. I want to find out if this is true and if it is worth investing in SSDs for improving the IO rates in data intensive applications.

like image 579
signalseeker Avatar asked Nov 24 '10 02:11

signalseeker


1 Answers

My 2 cents: SSD only pays off if your applications are stored on it, not your data. And even then only if a lot of access to disk is necessary, like for an OS. People are right to point you to profiling. I can tell you without doing it that almost all of the reading time goes to processing, not to reading on the disk.

It pays off far more to think about the format of your data instead of where it's stored. A speedup in reading your data can be obtained by using the right applications and the right format. Like using R's internal format instead of fumbling around with text files. Make that an exclamation mark: never keep on fumbling around with text files. Go binary if speed is what you need.

Due to the overhead, it generally doesn't make a difference if you have an SSD or a normal disk to read your data from. I have both, and use the normal disk for all my data. I do juggle around big datasets sometimes, and never experienced a problem with it. Off course, if I have to go really heavy, I just work on our servers.

So it might make a difference when we're talking gigs and gigs of data, but even then I doubt very much that disk access is the limiting factor. Unless your continuously reading and writing to the disk, but then I'd say you should start thinking again about what exactly you're doing. Instead of spending that money on SDD drives, extra memory could be the better option. Or just convince the boss to get you a decent calculation server.

A timing experiment using a bogus data frame, and reading and writing in text format vs. binary format on a SSD disk vs. a normal disk.

> tt <- 100
> longtext <- paste(rep("dqsdgfmqslkfdjiehsmlsdfkjqsefr",1000),collapse="")
> test <- data.frame(
+     X1=rep(letters,tt),
+     X2=rep(1:26,tt),
+     X3=rep(longtext,26*tt)
+ )

> SSD <- "C:/Temp" # My ssd disk with my 2 operating systems on it.
> normal <- "F:/Temp" # My normal disk, I use for data

> # Write text 
> system.time(write.table(test,file=paste(SSD,"test.txt",sep="/")))
   user  system elapsed 
   5.66    0.50    6.24 

> system.time(write.table(test,file=paste(normal,"test.txt",sep="/")))
   user  system elapsed 
   5.68    0.39    6.08 

> # Write binary
> system.time(save(test,file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(save(test,file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> # Read text 
> system.time(read.table(file=paste(SSD,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.57    0.05    8.61 

> system.time(read.table(file=paste(normal,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.53    0.09    8.63 

> # Read binary
> system.time(load(file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(load(file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 
like image 159
Joris Meys Avatar answered Sep 17 '22 01:09

Joris Meys