Following the post about data.table and parallel computing, I'm trying to find a way to get an operation on a data.table
parallized.
I have a data.table
with 4 million rows of 14 observations and would like to share it in a common memory so that operations on it can be parallelized by using the "parallel"-package with parLapply
without having to copy the table for each node in the cluster (what parLapply
does). At the moment the costs for moving the data.table
around are bigger than the benefit of parallel computation.
I found the "bigmemory"-package as an answer for sharing memory, but it doesn't maintain the "data.table"-structure of the data. So does anyone know a way to:
1) put the data.table
in shared memory
2) maintain the "data.table"-structure of the data by doing so
3) use parallel processing on this data.table
?
Thanks in advance!
Old question, but here is an answer since nobody else has answered and it might be helpful. I assume the problem you are having is because you are on windows and having to use the PSOCK
type of cluster. Unfortunately for windows this means you have to copy the data to each node. However, there is a work around. Get hold of docker and spin up an Rserve instance on the docker vm (e.g. stevenpollack/docker-rserve
). Since this will be linux based you can create a FORK
cluster on the docker vm. Then using your native R instance you can send over only once copy of the data to the Rserve instance (check out the RSclient
library), do your parallelized job on the vm, and collect the results back into your native R.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With