Suppose I need to loop over the rows in a data frame for some reason.
I create a simple data.frame
df <- data.frame(id = sample(1e6, 1e7, replace = TRUE))
It seems that f2 is much slower than f1, while I expected them to be equivalent.
f1 <- function(v){
for (obs in 1:(1e6) ){
a <- v[obs]
}
a
}
system.time(f1(df$id))
f2 <- function(){
for (obs in 1:(1e6) ){
a <- df$id[obs]
}
a
}
system.time(f2())
Would you know why? Do they use exactly the same amount of memory?
In order to iterate over rows, we apply a function itertuples() this function return a tuple for each row in the DataFrame. The first element of the tuple will be the row's corresponding index value, while the remaining values are the row values.
Itertuples(): Itertuples() iterates through the data frame by converting each row of data as a list of tuples. itertuples() takes 16 seconds to iterate through a data frame with 10 million records that are around 50x times faster than iterrows().
DataFrame is iterated by for loop as it is, column names are returned. You can iterate over columns and rows of pandas. DataFrame with the iteritems() , iterrows() , and itertuples() methods.
Iterate Over DataFrame Columns One simple way to iterate over columns of pandas DataFrame is by using for loop. You can use column-labels to run the for loop over the pandas DataFrame using the get item syntax ([]) . Yields below output. The values() function is used to extract the object elements as a list.
If you instead write your timings like this and recognize that df$x
is really a function call (to `$`(df,x)
) the mystery disappears:
system.time(for(i in 1:1e6) df$x)
# user system elapsed
# 8.52 0.00 8.53
system.time(for(i in 1) df$x)
# user system elapsed
# 0 0 0
In f1
, you bypass the data frame entirely by just passing a vector to your function. So your code is essentially "I have a vector! This is the first element. This is the second element. This is the third..."
By contrast, in f2
, you give it a whole data frame and then get the each element of a single column each time. So your code is "I have a data frame. This is the first element of the ID column. This is the second element of the ID column. This is the third..."
It's much faster if you extract the simple data structure (vector) once, and then can only work with that, rather than repeatedly extracting the simple structure from the larger object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With