Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing calculations by subsets of data in R

Tags:

foreach

r

subset

I want to perform calculations for each company number in the column PERMNO of my data frame, the summary of which can be seen here:

> summary(companydataRETS)
     PERMNO           RET           
 Min.   :10000   Min.   :-0.971698  
 1st Qu.:32716   1st Qu.:-0.011905  
 Median :61735   Median : 0.000000  
 Mean   :56788   Mean   : 0.000799  
 3rd Qu.:80280   3rd Qu.: 0.010989  
 Max.   :93436   Max.   :19.000000  

My solution so far was to create a variable with all possible company numbers

compns <- companydataRETS[!duplicated(companydataRETS[,"PERMNO"]),"PERMNO"]

And then use a foreach loop using parallel computing which calls my function get.rho() which in turn perform the desired calculations

rhos <- foreach (i=1:length(compns), .combine=rbind) %dopar% 
      get.rho(subset(companydataRETS[,"RET"],companydataRETS$PERMNO == compns[i]))

I tested it for a subset of my data and it all works. The problem is that I have 72 million observations, and even after leaving the computer working overnight, it still didn't finish.

I am new in R, so I imagine my code structure can be improved upon and there is a better (quicker, less computationally intensive) way to perform this same task (perhaps using apply or with, both of which I don't understand). Any suggestions?

like image 419
Vivi Avatar asked Oct 08 '22 10:10

Vivi


1 Answers

As suggested by Joran, I looked into the library data.table. The modifications to the code are

library(data.table) 
companydataRETS <- data.table(companydataRETS)
setkey(companydataRETS,PERMNO)

rhos <- foreach (i=1:length(compns), .combine=rbind) %do% 
      get.rho(companydataRETS[J(compns[i])]$RET)

I ran the code as I originally had (using subset) and once using data.table, with the variable compns comprising of only 30 of the 28659 companies in the dataset. Here are the outputs of system.time() for the two versions:

Using subset:

user........ system..... elapsed
43.925 ... 12.413...... 56.337

Using data.table

user....... system..... elapsed
0.229..... 0.047....... 0.276

(For some reason using %do% instead of %dopar% for the original code made it ran faster. The system.time() for subset is the one using %do%, the faster of the two in this case.)

I had left the original code running overnight and it hadn't finished after 5 hours, so I gave up and killed it. With this small modification I had my results in less than 5 minutes (I think about 3 mins)!

EDIT

There is an even easier way to do it using data.table, without the use of foreach, which involves substituting the last line of the code above by

rhos <- companydataRETS[ , get.rho(RET), by=PERMNO]
like image 92
Vivi Avatar answered Oct 12 '22 21:10

Vivi