I've heard that you're not meant to force a procedural programming style onto R. I'm finding this pretty hard. I've just solved a problem with a for loop. Is this wrong? Is there a better, more "R-style" solution?
The problem: I have two columns: Col1 and Col2. Col1 contains job titles that have been entered in a free form way. I want to use Col2 to collect these job titles into categories (so "Junior Technician", "Engineering technician" and "Mech. tech." are all listed as "Technician".
I've done it like this:
jobcategories<-list(
"Junior Technician|Engineering technician|Mech. tech." = "Technician",
"Manager|Senior Manager|Group manager|Pain in the ****" = "Manager",
"Admin|Administrator|Group secretary" = "Administrator")
for (currentjob in names(jobcategories)) {
df$Col2[grep(currentjob,data$Col1)] <- jobcategories[[currentjob]]
}
This produces the right results, but I can't shake the feeling that (because of my procedural experience) I'm not using R properly. Could an R expert put me out of my misery?
EDIT
I was asked for the original data. Unfortunately, I can't supply it, because it's got confidential info in it. It's basically two columns. The first column holds just over 400 rows of different job titles (and the odd personal name). There are about 20 different categories that these 400 titles can be split into. The second column starts off as NA, then gets populated after running the for loop.
For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it's possible to wrap up for loops in a function, and call that function instead of using the for loop directly.
Using 'for' loop is not wrong, but there are many other alternatives which make our code better to read and less complex in nature. Usage of such loops slows down our program. 'For' loops are considered obsolete, hence we should avoid using them.
A for-loop is one of the main control-flow constructs of the R programming language. It is used to iterate over a collection of objects, such as a vector, a list, a matrix, or a dataframe, and apply the same set of operations on each item of a given data structure.
There is a lot of overhead in the processing because R needs to check the type of a variable nearly every time it looks at it. This makes it easy to change types and reuse variable names, but slows down computation for very repetitive tasks, like performing an action in a loop.
You are right that for loops are often discouraged in R, and in my experience this is for two main reasons:
As eloquently described in circle 2 of the R inferno, it can be extremely inefficient to grow an object one element at a time, as is often the temptation in for loops. For instance, this is a pretty common yet inefficient work flow, because it reallocates output
each iteration of the loop:
output <- c()
for (idx in indices) {
scalar <- compute.new.scalar(idx)
output <- c(output, scalar)
}
This inefficiency can be removed by pre-allocating output
to the proper size and using a for loop or by using a function like sapply
.
The second source of inefficiency comes from performing a for loop over a fast operation when a vectorized alternative exists. For instance, consider the following code:
s <- 0
for (elt in x) {
s <- s + elt
}
This is a for loop over a very fast operation (adding two numbers), and the overhead of the loop will be significant compared to the vectorized sum
function, which adds up all the elements in the vector. The sum
function is quick because it's implemented in C, so it will be more efficient to do s <- sum(x)
than to use the for loop (not to mention less typing). Sometime it takes more creativity to figure out how to replace a for loop with a fast interior with a vectorized alternative (cumsum
and diff
come up a lot), but it can lead to significant efficiency improvements. In cases where you have a fast loop interior but can't figure out how to use vectorized functions to achieve the same thing, I've found that reimplementing the loop with the Rcpp package can yield a faster alternative.
For loops can be slow if you are incorrectly growing objects or you have a very fast interior of the loop and the entire thing can be replaced with a vectorized operation. Otherwise you're probably not losing too much efficiency, as the apply family of functions are performing for loops on the inside, too.
for
loops are not 'evil' in R but they are typically slow compared to vector based methods and frequently not the best available solution, however they are easy to implement and easy to understand and you should not under-estimate the value of either of these.
In my view, therefore, you should use a for
loop if you need to get something done quickly and can't see a better way to do it and you don't need to worry too much about speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With