I have a paneldata which looks like:
(Only the substantially cutting for my question)
Persno 122 122 122 333 333 333 333 333 444 444
Income 1500 1500 2000 2000 2100 2500 2500 1500 2000 2200
year 1990 1991 1992 1990 1991 1992 1993 1994 1992 1993
Now I would like to give out for every row (PErsno) the years of workexperience at the begining of the year. I use ddply
hilf3<-ddply(data, .(Persn0), summarize, Bgwork = 1:(max(year) - min(year)))
To produce output looking like this:
Workexperience: 1 2 3 1 2 3 4 5 1 2
Now I want to merge the ddply
results to my original panel data:
data<-(merge(data,hilf3,by.x="Persno",by.y= "Persno"))
The panel data set is very large. The code stops because of a memory size error.
Errormessage:
1: In make.unique(as.character(rows)) :
Reached total allocation of 4000Mb: see help(memory.size)
What should I do?
If you need to merge large data frames in R, one good option is to do it in pieces of, say 10000 rows. If you're merging data frames x and y, loop over 10000-row pieces of x, merge (or rather use plyr::join
) with y and immediately append these results to a sigle csv-file. After all pieces have been merged and written to file, read that csv-file. This is very memory-efficient with proper use of logical index vectors and well placed rm
and gc
calls. It's not fast though.
Well, perhaps the surest way of fixing this is to get more memory. However, this isn't always an option. What you can do is somewhat dependent on your platform. On Windows, check the results of memory.size()
and compare this to your available RAM. If memory size is lower than RAM then you can increase it. This is not an option on linux, as by default it will show all of your memory.
Another issue that can complicate matters is whether or not you are running a 32bit or 64bit system, as 32bit windows can only address up to a certain amount of RAM (2-4GB) depending on settings. This is not an issue if you are using 64bit Windows 7, which can address far more memory.
A more practical solution is to eliminate all unnecessary objects from your workspace before performing merge. You should run gc()
to see how much memory you have and are using, and also to remove any objects which have no more references. Personally, I would probably run your ddply()
from a script, then save the resulting dataframe as a CSV file, close your workspace and reopen it and then perform the merge again.
Finally the worst possible option (but which does require a whole lot less memory) is to create a new dataframe, and use the subsetting commands in R to copy the columns you want over, one by one. I really don't recommend this as it is tiresome and error prone, but I have had to do it once when there was no way to complete my analysis otherwise (i ended up investing in a new computer with more RAM shortly afterwards).
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With