Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I avoid complex for loops?

Tags:

for-loop

r

I am currently working with a series of large datasets and I'm trying to improve how I write scripts in R. I tend to mostly make use of for loops which I know can be cumbersome and slow, espeically with very large datasets.

I have heard a lot of people recommending the apply() family to avoid complex for loops, but I am struggling to get my head around using them to apply multiple functions in one go.

Here is some simple example data:

A <- data.frame('Area' = c(4, 6, 5),
                'flow' = c(1, 1, 1))
B <- data.frame('Area' = c(6, 8, 4),
                'flow' = c(1, 2, 1))
files <- list(A, B)
frames <- list('A', 'B')

What I want to do is sort the data by the 'flow' variable, then add columns for the portion of total 'flow' and 'area' each data point represents, before finally adding a further two columns of the cumulative percentage of each variable.

Currently I use this for loop:

sort_files <- list()
n <- 1
for(i in files){
  name <- frames[n]
  nom <- paste(name,'_sorted', sep = '')
  data <- i[order(-i$flow),]
  area <- sum(i$Area)
  total <- sum(i$flow)
  data$area_portion <- (data$Area/area)*100
  data$flow_portion <- (data$flow/total)*100
  data$cum_area <- cumsum(data$area_portion)
  data$cum_flow <- cumsum(data$flow_portion)
  assign(nom, data)
  df <- get(paste(name,'_sorted', sep = ''))
  sort_files[[nom]] <- df
  n <- n + 1
}

Which works, but seems overly complex and ugly, and I'm sure it will run far slower than better scripts.

How can I simplify and neaten up the above code?

This is the expected output:

sort_files

$`A_sorted`
  Area flow area_portion flow_portion  cum_area  cum_flow
1    4    1     26.66667     33.33333  26.66667  33.33333
2    6    1     40.00000     33.33333  66.66667  66.66667
3    5    1     33.33333     33.33333 100.00000 100.00000

$B_sorted
  Area flow area_portion flow_portion  cum_area cum_flow
2    8    2     44.44444           50  44.44444       50
1    6    1     33.33333           25  77.77778       75
3    4    1     22.22222           25 100.00000      100
like image 922
tom91 Avatar asked Jan 31 '19 08:01

tom91


People also ask

What can I use instead of a for loop?

Array. filter, map, some have the same performance as forEach. These are all marginally slower than for/while loop. Unless you are working on performance-critical functionalities, it should be fine using the above methods.

Should you avoid nested for loops?

Nesting loops inside of each other in python makes for much harder code to understand, it takes more brain power to understand, and is thus more error prone than if its avoidable.

How do you avoid multiple loops?

By increasing space complexity you can reduce the time complexity. If you can maintain another ArrayList just to save Locations it will require more memory but then you can directly loop through the Locations ArrayList with single loop.


2 Answers

Using lapply to loop over files and dplyr mutate to add new columns

library(dplyr)

setNames(lapply(files, function(x) 
          x %>%
            arrange(desc(flow)) %>%
            mutate(area_portion = Area/sum(Area)*100, 
                   flow_portion = flow/sum(flow) * 100, 
                   cum_area = cumsum(area_portion),
                   cum_flow = cumsum(flow_portion))
),paste0(frames, "_sorted"))


#$A_sorted
#  Area flow area_portion flow_portion  cum_area  cum_flow
#1    4    1     26.66667     33.33333  26.66667  33.33333
#2    6    1     40.00000     33.33333  66.66667  66.66667
#3    5    1     33.33333     33.33333 100.00000 100.00000

#$B_sorted
#  Area flow area_portion flow_portion  cum_area cum_flow
#1    8    2     44.44444           50  44.44444       50
#2    6    1     33.33333           25  77.77778       75
#3    4    1     22.22222           25 100.00000      100

Or completely going tidyverse way we can change lapply with map and setNames with set_names

library(tidyverse)

map(set_names(files, str_c(frames, "_sorted")), 
  . %>% arrange(desc(flow)) %>%
  mutate(area_portion = Area/sum(Area)*100, 
         flow_portion = flow/sum(flow) * 100, 
         cum_area = cumsum(area_portion),
         cum_flow = cumsum(flow_portion)))

Updated the tidyverse approach following some pointers from @Moody_Mudskipper.

like image 116
Ronak Shah Avatar answered Oct 02 '22 10:10

Ronak Shah


You could also define a function first ..

f <- function(data) {

  # sort data by flow
  data <- data[order(data['flow'], decreasing = TRUE), ]

  # apply your functions
  data["area_portion"] <- data['Area'] / sum(data['Area']) * 100
  data["flow_portion"] <- data['flow'] / sum(data['flow']) * 100
  data["cum_area"] <- cumsum(data['area_portion'])
  data["cum_flow"] <- cumsum(data['flow_portion'])
  data
  }

.. and use lapply to, ahhm, apply f to your list

out <- lapply(files, f)
out
#[[1]]
#  Area flow area_portion flow_portion  cum_area  cum_flow
#1    4    1     26.66667     33.33333  26.66667  33.33333
#2    6    1     40.00000     33.33333  66.66667  66.66667
#3    5    1     33.33333     33.33333 100.00000 100.00000

#[[2]]
#  Area flow area_portion flow_portion  cum_area cum_flow
#2    8    2     44.44444           50  44.44444       50
#1    6    1     33.33333           25  77.77778       75
#3    4    1     22.22222           25 100.00000      100

If you want to change the names of out you can use setNames

out <- setNames(lapply(files, f), paste0(c("A", "B"), "_sorted"))
# or
# out <- setNames(lapply(files, f), paste0(unlist(frames), "_sorted"))
like image 32
markus Avatar answered Oct 02 '22 09:10

markus