I have 12 data.frame
s to work with. They are similar and I have to do the same processing to each one, so I wrote a function that takes a data.frame
, processes it, and then returns a data.frame
. This works. But I am afraid that I am passing around a very big structure. I may be making temporary copies (am I?) This can't be efficient. What is the best way to avoid passing a data.frame
around?
doSomething <- function(df) {
// do something with the data frame, df
return(df)
}
Use the dplyr package to manipulate data frames. Use select() to choose variables from a data frame. Use filter() to choose data based on values. Use group_by() and summarize() to work with subsets of data.
summary(): provides summary statistics on the columns of the data frame. colnames(): shows the name of each column in the data frame. head(): shows the first 6 rows of the data frame. tail(): shows the last 6 rows of the data frame.
We can create a dataframe in R by passing the variable a,b,c,d into the data. frame() function. We can R create dataframe and name the columns with name() and simply specify the name of the variables.
Well, the function head() enables you to show the first observations of a data frame. Similarly, the function tail() prints out the last observations in your data set. Both head() and tail() print a top line called the 'header', which contains the names of the different variables in your data set.
You are, indeed, passing the object around and using some memory. But I don't think you can do an operation on an object in R without passing the object around. Even if you didn't create a function and did your operations outside of the function, R would behave basically the same.
The best way to see this is to set up an example. If you are in Windows open Windows Task Manager. If you are in Linux open a terminal window and run the top command. I'm going to assume Windows in this example. In R run the following:
col1<-rnorm(1000000,0,1)
col2<-rnorm(1000000,1,2)
myframe<-data.frame(col1,col2)
rm(col1)
rm(col2)
gc()
this creates a couple of vectors called col1 and col2 then combines them into a data frame called myframe. It then drops the vectors and forces garbage collection to run. Watch in your windows task manager at the mem usage for the Rgui.exe task. When I start R it uses about 19 meg of mem. After I run the above commands my machine is using just under 35 meg for R.
Now try this:
myframe<-myframe+1
your memory usage for R should jump to over 144 meg. If you force garbage collection using gc() you will see it drop back to around 35 meg. To try this using a function, you can do the following:
doSomething <- function(df) {
df<-df+1-1
return(df)
}
myframe<-doSomething(myframe)
when you run the code above, memory usage will jump up to 160 meg or so. Running gc() will drop it back to 35 meg.
So what to make of all this? Well, doing an operation outside of a function is not that much more efficient (in terms of memory) than doing it in a function. Garbage collection cleans things up real nice. Should you force gc() to run? Probably not as it will run automatically as needed, I just ran it above to show how it impacts memory usage.
I hope that helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With