Suppose I need to loop over the rows in a data frame for some reason. I create a simple data.frame <pre class="prettyprint"><code>df <- data.frame(id = sample(1e6, 1e7, replace = TRUE)) </code></pre> It seems that f2 is much slower than f1, while I expected them to be equivalent. <pre class="prettyprint"><code>f1 <- function(v){ for (obs in 1:(1e6) ){ a <- v[obs] } a } system.time(f1(df$id)) f2 <- function(){ for (obs in 1:(1e6) ){ a <- df$id[obs] } a } system.time(f2()) </code></pre> Would you know why? Do they use exactly the same amount of memory?

If you instead write your timings like this and recognize that <code>df$x</code> is really a function call (to <code>`$`(df,x)</code>) the mystery disappears: <pre class="prettyprint"><code>system.time(for(i in 1:1e6) df$x) # user system elapsed # 8.52 0.00 8.53 system.time(for(i in 1) df$x) # user system elapsed # 0 0 0 </code></pre>

Looping over rows in a dataframe

Tags:

performance

r

Suppose I need to loop over the rows in a data frame for some reason.

I create a simple data.frame

df <- data.frame(id = sample(1e6, 1e7, replace = TRUE))

It seems that f2 is much slower than f1, while I expected them to be equivalent.

f1 <- function(v){
        for (obs in 1:(1e6) ){
            a <- v[obs] 
        }
        a
    }
system.time(f1(df$id))

f2 <- function(){
        for (obs in 1:(1e6) ){
            a <- df$id[obs] 
        }
    a
    }
system.time(f2())

Would you know why? Do they use exactly the same amount of memory?

761

asked May 28 '15 17:05

Matthew

2 Answers

If you instead write your timings like this and recognize that df$x is really a function call (to `$`(df,x)) the mystery disappears:

system.time(for(i in 1:1e6) df$x)
#    user  system elapsed 
#    8.52    0.00    8.53 
system.time(for(i in 1) df$x)
#    user  system elapsed 
#       0       0       0

160

answered Oct 26 '22 06:10

Josh O'Brien

In f1, you bypass the data frame entirely by just passing a vector to your function. So your code is essentially "I have a vector! This is the first element. This is the second element. This is the third..."

By contrast, in f2, you give it a whole data frame and then get the each element of a single column each time. So your code is "I have a data frame. This is the first element of the ID column. This is the second element of the ID column. This is the third..."

It's much faster if you extract the simple data structure (vector) once, and then can only work with that, rather than repeatedly extracting the simple structure from the larger object.

answered Oct 26 '22 07:10

Gregor Thomas

Related questions
                            
                                Extract hyperlink from Excel file in R
                            
                                send R diagnostic messages to stdout instead stderr
                            
                                R texreg: How can I select the gof statistics to be displayed?
                            
                                Ordering of points in R lines plot
                            
                                Release memory by gc() in silence
                            
                                define color gradient for negative and positive values scale_fill_gradientn()
                            
                                Knitr does not render googleVis
                            
                                Summarize (count/freq) by treatment type where individuals could receive both treatments
                            
                                R: The system cannot find the file specified?
                            
                                Remove non printable white spaces from unknown (to me) encoding
                            
                                Obtain date column from xts object [duplicate]
                            
                                Less smoothed line in ggplot2, alternatives to geom_smooth? [duplicate]
                            
                                data.table operation with .SD: calculating percentage change concisely
                            
                                Dynamic ylim in ggplot2 using dplyr pipe
                            
                                Make a column with duplicated values unique in a dataframe
                            
                                R sort summarise ddply by group sum
                            
                                When is Lexical Scope for a function within a function determined?
                            
                                How can I apply a gradient fill to a geom_rect object in ggplot2?
                            
                                Looping through date in R loses format
                            
                                cSplit library(splitstackshape) is always dropping the column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With