Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does changing a column name take an extremely long time with a large data.frame?

Tags:

I have a data.frame in R with 19 million rows and 90 columns. I have plenty of spare RAM and CPU cycles. It seems that changing a single column name in this data frame is a very intense operation for R.

system.time(colnames(my.df)[1] <- "foo")    user  system elapsed   356.88   16.54  373.39  

Why is this so? Does every row store the column name somehow? Is this creating an entirely new data frame? It seems this operation should complete in negligible time. I don't see anything obvious in the R manual entry.

I'm running build 7600 of R (64bit) on Windows 7, and in my current workspace, setting colnames on a small data.frame takes '0' time according to system.time().

Edit: I'm aware of the possibility of using data.table, and, honestly, I can wait 5 minutes for the rename to complete whilst I go get some tea. What I'm interested in is what is happening and why?

like image 835
Ina Avatar asked Jun 14 '12 17:06

Ina


People also ask

What function will change the name of a column in a DataFrame?

Python's rename column is a method used to change the column names with pandas' rename function. It's useful when you load a tabular dataset that has no column names or if you want to assign different names to specific columns.

How do you rename a column in Python?

You can rename the columns using the rename() method by using the axis keyword in it. What is this? In this method, you'll specify the columns as Python Set within { } rather specifying columns as a Python Dictionary with Key-Value Pairs. This method can also be used to rename the rows/indexes of the Pandas DataFrame.


1 Answers

As several commenters have mentioned, renaming data frame columns is slow, because (depending on how you do it) it makes between 1 and 4 copies of the entire data.frame. Here, from data.table's ?setkey help page, is the nicest way of demonstrating this behavior that I've seen:

DF = data.frame(a=1:2,b=3:4)       # base data.frame to demo copies try(tracemem(DF))                  # try() for non-Windows where R is                                     # faster without memory profiling colnames(DF)[1] <- "A"             # 4 copies of entire object names(DF)[1] <- "A"                # 3 copies of entire object names(DF) <- c("A", "b")           # 1 copy of entire object `names<-`(DF,c("A","b"))           # 1 copy of entire object x=`names<-`(DF,c("A","b"))         # still 1 copy (so not print method) # What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name? 

To (start) understanding why things are done this way, you'll probably need to delve into some of the related discussions on R-devel. Here are a couple: R-devel: speeding up perception and R-devel: Confused about NAMES

My impressionistic reading of those threads is that:

  1. At least one copy is made so that modifications to it can be 'tried out' before overwriting the original. Thus, if something is wrong with the value-to-be-reassigned, [<-.data.frame or names<- can 'back out' and deliver an error message without having done any damage to the original object.

  2. Several members of R-core aren't completely satisfied with how things are working right now. Several folks explain that in some cases "R loses track"; Luke Tierney indicates that he's tried some modifications relating to this copying in the past "in a few cases and always had to back off"; and Simon Urbanek hints that "there may be some things coming up, too"

(As I said, though, that's just impressionistic: I'm simply not able to follow a full conversation about the details of R's internals!)


Also relevant, in case you haven't seen it, here's how something like names(z)[3] <- "c2" "really" works:

# From ?names<- z <- "names<-"(z, "[<-"(names(z), 3, "c2")) 

Note: Much of this answer comes from Matthew Dowle's answer to this other question. (I thought it was worth placing it here, and giving it some more exposure, since it's so relevant to your own question).

like image 78
Josh O'Brien Avatar answered Feb 23 '23 06:02

Josh O'Brien