Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I improve this R function

Tags:

r

I am new to R. I created the function below to calculate the mean of dataset contained in 332 csv files. Seek advice on how I could improve this code. It takes 38 sec to run which make me think it is not very efficient.

pollutantmean <- function(directory, pollutant, id = 1:332) {
        files_list <- list.files(directory, full.names = TRUE) #creats list of files
        dat <- data.frame() #creates empty dataframe
                for(i in id){
                        dat<- rbind(dat,read.csv(files_list[i])) #combin all the monitor data together
}
        good <- complete.cases(dat) #remove all NA values from dataset
        mean(dat[good,pollutant]) #calculate mean
} #run time ~ 37sec - NEED TO OPTIMISE THE CODE
like image 935
RiskyB Avatar asked Jun 04 '26 12:06

RiskyB


1 Answers

Instead of creating a void data.frame and rbind each time with a for loop, you can store all data.frames in a list and combine them in one shot. You can also use na.rm option of mean function not to take into account NA values.

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    files_list = list.files(directory, full.names = TRUE)[id] 
    df         = do.call(rbind, lapply(files_list, read.csv))

    mean(df[[pollutant]], na.rm=TRUE)
}

Optional - I would increase the readability with magrittr:

library(magrittr)

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    list.files(directory, full.names = TRUE)[id] %>%
        lapply(read.csv) %>%
        do.call(rbind,.) %>%
        extract2(pollutant) %>%
        mean(na.rm=TRUE)
}
like image 56
Colonel Beauvel Avatar answered Jun 06 '26 04:06

Colonel Beauvel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!