Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looping over rows in a dataframe

Tags:

performance

r

Suppose I need to loop over the rows in a data frame for some reason.

I create a simple data.frame

df <- data.frame(id = sample(1e6, 1e7, replace = TRUE))

It seems that f2 is much slower than f1, while I expected them to be equivalent.

f1 <- function(v){
        for (obs in 1:(1e6) ){
            a <- v[obs] 
        }
        a
    }
system.time(f1(df$id))

f2 <- function(){
        for (obs in 1:(1e6) ){
            a <- df$id[obs] 
        }
    a
    }
system.time(f2())

Would you know why? Do they use exactly the same amount of memory?

like image 761
Matthew Avatar asked May 28 '15 17:05

Matthew


People also ask

How do you loop through a row in a data frame?

In order to iterate over rows, we apply a function itertuples() this function return a tuple for each row in the DataFrame. The first element of the tuple will be the row's corresponding index value, while the remaining values are the row values.

What is faster for looping over a DataFrame Iterrows or Itertuples?

Itertuples(): Itertuples() iterates through the data frame by converting each row of data as a list of tuples. itertuples() takes 16 seconds to iterate through a data frame with 10 million records that are around 50x times faster than iterrows().

Can you use for loop in DataFrame?

DataFrame is iterated by for loop as it is, column names are returned. You can iterate over columns and rows of pandas. DataFrame with the iteritems() , iterrows() , and itertuples() methods.

How do I iterate over a pandas DataFrame column?

Iterate Over DataFrame Columns One simple way to iterate over columns of pandas DataFrame is by using for loop. You can use column-labels to run the for loop over the pandas DataFrame using the get item syntax ([]) . Yields below output. The values() function is used to extract the object elements as a list.


2 Answers

If you instead write your timings like this and recognize that df$x is really a function call (to `$`(df,x)) the mystery disappears:

system.time(for(i in 1:1e6) df$x)
#    user  system elapsed 
#    8.52    0.00    8.53 
system.time(for(i in 1) df$x)
#    user  system elapsed 
#       0       0       0 
like image 160
Josh O'Brien Avatar answered Oct 26 '22 06:10

Josh O'Brien


In f1, you bypass the data frame entirely by just passing a vector to your function. So your code is essentially "I have a vector! This is the first element. This is the second element. This is the third..."

By contrast, in f2, you give it a whole data frame and then get the each element of a single column each time. So your code is "I have a data frame. This is the first element of the ID column. This is the second element of the ID column. This is the third..."

It's much faster if you extract the simple data structure (vector) once, and then can only work with that, rather than repeatedly extracting the simple structure from the larger object.

like image 34
Gregor Thomas Avatar answered Oct 26 '22 07:10

Gregor Thomas