Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to avoid passing a data frame around?

Tags:

dataframe

r

I have 12 data.frames to work with. They are similar and I have to do the same processing to each one, so I wrote a function that takes a data.frame, processes it, and then returns a data.frame. This works. But I am afraid that I am passing around a very big structure. I may be making temporary copies (am I?) This can't be efficient. What is the best way to avoid passing a data.frame around?

doSomething <- function(df) {
  // do something with the data frame, df
  return(df)
}
like image 596
Chang Chung Avatar asked Feb 27 '09 21:02

Chang Chung


People also ask

How do you manipulate a Dataframe in R?

Use the dplyr package to manipulate data frames. Use select() to choose variables from a data frame. Use filter() to choose data based on values. Use group_by() and summarize() to work with subsets of data.

Which summary functions can you use to preview data frames in R Select all that apply 1 point str () head () glimpse () mutate ()?

summary(): provides summary statistics on the columns of the data frame. colnames(): shows the name of each column in the data frame. head(): shows the first 6 rows of the data frame. tail(): shows the last 6 rows of the data frame.

How do you create a Dataframe in R?

We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.

What is the function in R to get the of observations in a data frame?

Well, the function head() enables you to show the first observations of a data frame. Similarly, the function tail() prints out the last observations in your data set. Both head() and tail() print a top line called the 'header', which contains the names of the different variables in your data set.


1 Answers

You are, indeed, passing the object around and using some memory. But I don't think you can do an operation on an object in R without passing the object around. Even if you didn't create a function and did your operations outside of the function, R would behave basically the same.

The best way to see this is to set up an example. If you are in Windows open Windows Task Manager. If you are in Linux open a terminal window and run the top command. I'm going to assume Windows in this example. In R run the following:

col1<-rnorm(1000000,0,1)
col2<-rnorm(1000000,1,2)
myframe<-data.frame(col1,col2)

rm(col1)
rm(col2)
gc()

this creates a couple of vectors called col1 and col2 then combines them into a data frame called myframe. It then drops the vectors and forces garbage collection to run. Watch in your windows task manager at the mem usage for the Rgui.exe task. When I start R it uses about 19 meg of mem. After I run the above commands my machine is using just under 35 meg for R.

Now try this:

myframe<-myframe+1

your memory usage for R should jump to over 144 meg. If you force garbage collection using gc() you will see it drop back to around 35 meg. To try this using a function, you can do the following:

doSomething <- function(df) {
    df<-df+1-1
return(df)
}
myframe<-doSomething(myframe)

when you run the code above, memory usage will jump up to 160 meg or so. Running gc() will drop it back to 35 meg.

So what to make of all this? Well, doing an operation outside of a function is not that much more efficient (in terms of memory) than doing it in a function. Garbage collection cleans things up real nice. Should you force gc() to run? Probably not as it will run automatically as needed, I just ran it above to show how it impacts memory usage.

I hope that helps!

like image 121
JD Long Avatar answered Oct 03 '22 11:10

JD Long