I've built a little software that can be customized via a config file. The config file is parsed and translated into a nested environment structure (e.g. .HIVE$db = an environment, .HIVE$db$user = "Horst", .HIVE$db$pw = "my password", .HIVE$regex$date = some regex for dates etc.)
I've built routines that can handle those nested environments (e.g. look up value "db/user" or "regex/date", change it etc.). The thing is that the initial parsing of the config files takes a long time and results in quite a big of an object (actually three to four, between 4 and 16 MB). So I thought "No problem, let's just cache them by saving the object(s) to .Rdata files". This works, but "loading" cached objects makes my Rterm process go through the roof with respect to RAM consumption (over 1 GB!!) and I still don't really understand why (this doesn't happen when I "compute" the object all anew, but that's exactly what I'm trying to avoid since it takes too long).
I already thought about maybe serializing it, but I haven't tested it as I would need to refactor my code a bit. Plus I'm not sure if it would affect the "loading back into R" part in just the same way as loading .Rdata files.
Can anyone tell me why loading a previously computed object has such effects on memory consumption of my Rterm process (compared to computing it in every new process I start) and how best to avoid this?
If desired, I will also try to come up with an example, but it's a bit tricky to reproduce my exact scenario. Yet I'll try.
The reason Linux uses so much memory for disk cache is because the RAM is wasted if it isn't used. Keeping the cache means that if something needs the same data again, there's a good chance it will still be in the cache in memory.
The larger the cache, the better the chances are that least recently accessed information remains in the cache and has not been automatically removed. The smaller the cache, the more likely that least recently accessed information will be removed from the cache.
Cache memory is a large determinant of system performance. The larger the cache, the more instructions can be queued and carried out. Storing instructions in cache reduces the amount of time it takes to access that instruction and pass it to a CPU core.
A cache's primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer. Trading off capacity for speed, a cache typically stores a subset of data transiently, in contrast to databases whose data is usually complete and durable.
Its likely because the environments you are creating are carrying around their ancestors. If you don't need the ancestor information then set the parents of such environments to emptyenv()
(or just don't use environments if you don't need them).
Also note that formulas (and, of course, functions) have environments so watch out for those too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With