Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with repetitive tasks in R

I often find myself having to perform repetitive tasks in R. It gets extremely frustrating having to constantly run the same function on one or more data structures over and over again.

For example, let's say I have three separate data frames in R, and I want to delete the rows in each data frame which possess a missing value. With three data frames, it's not all that difficult to run na.omit() on each of the df's, but it can get extremely inefficient when one has one hundred similar data structures which require the same action.

df1 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
             variable=c(2004,2004,2004,2004,2004,2004), value=c(35,20,20,50,30,NA))

df2 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
            variable=c(2005,2005,2005,2005,2005,2005), value=c(55,350,40,90,99,NA))

df3 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
           variable=c(2006,2006,2006,2006,2006,2006), value=c(300,200,200,500,300,NA))

tot04 <- na.omit(df1)
tot05 <- na.omit(df2)
tot06 <- na.omit(df3)

What are some general guidelines for dealing with repetitive tasks in R?

Yes, I recognise that the answer to this question is specific to the task that one faces, but I'm just asking about general things that a user should consider when they have a repetitive task.

like image 742
ATMathew Avatar asked May 12 '11 02:05

ATMathew


3 Answers

As a general guideline, if you have several objects that you want to apply the same operations to, you should collect them into one data structure. Then you can use loops, [sl]apply, etc to do the operations in one go. In this case, instead of having separate data frames df1, df2, etc, you could put them into a list of data frames and then run na.omit on all of them:

dflist <- list(df1, df2, <...>)
dflist <- lapply(dflist, na.omit)
like image 161
Hong Ooi Avatar answered Nov 14 '22 09:11

Hong Ooi


Besides @Hong Ooi answer I suggest looking into packages plyr and reshape. In your case following might be useful:

df1$name <- "var1"
df2$name <- "var2" 
df3$name <- "var3"
df <- rbind(df1,df2,df3)
df <- na.omit(df)

##Get various means:
> ddply(df,~name,summarise,AvgName=mean(value))
  name AvgName
  1 var1    31.0
  2 var2   126.8
  3 var3   300.0

> ddply(df,~Region,summarise,AvgRegion=mean(value)) 
     Region AvgRegion
1    Africa 190.00000
2      Asia 130.00000
3    Europe  86.66667
4 N.America 213.33333
5 S.America 143.00000


> ddply(df,~variable,summarise,AvgVar=mean(value))
  variable AvgVar
1     2004   31.0
2     2005  126.8
3     2006  300.0

##Transform the data.frame into another format   
> cast(Region+variable~name,data=df)
      Region variable var1 var2 var3
1     Africa     2004   20   NA   NA
2     Africa     2005   NA  350   NA
3     Africa     2006   NA   NA  200
4       Asia     2004   35   NA   NA
5       Asia     2005   NA   55   NA
6       Asia     2006   NA   NA  300
7     Europe     2004   20   NA   NA
8     Europe     2005   NA   40   NA
9     Europe     2006   NA   NA  200
10 N.America     2004   50   NA   NA
11 N.America     2005   NA   90   NA
12 N.America     2006   NA   NA  500
13 S.America     2004   30   NA   NA
14 S.America     2005   NA   99   NA
15 S.America     2006   NA   NA  300
like image 29
mpiktas Avatar answered Nov 14 '22 08:11

mpiktas


If the names are similar you could iterate over them using the pattern argument to ls:

for (i in ls(pattern="df")){
  assign(paste("t",i,sep=""),na.omit(get(i)))
}

However, a more "R" way of doing it seems to be to use separate environment and eapply:

# setup environment
env <- new.env()

# copy dataframes across (using common pattern)
for (i in ls(pattern="df")){
  asssign(i,get(i),envir=env)
  }

# apply function on environment
eapply(env,na.omit)

Which yields:

$df3
     Region variable value
1      Asia     2006   300
2    Africa     2006   200
3    Europe     2006   200
4 N.America     2006   500
5 S.America     2006   300

$df2
     Region variable value
1      Asia     2005    55
2    Africa     2005   350
3    Europe     2005    40
4 N.America     2005    90
5 S.America     2005    99

$df1
     Region variable value
1      Asia     2004    35
2    Africa     2004    20
3    Europe     2004    20
4 N.America     2004    50
5 S.America     2004    30

Unfortunately, this is one huge list so getting this out as seperate objects is a little tricky. Something on the lines of:

lapply(eapply(env,na.omit),function(x) assign(paste("t",substitute(x),sep=""),x,envir=.GlobalEnv))

should work, but the substitute is not picking out the list element names properly.

like image 41
James Avatar answered Nov 14 '22 10:11

James