First my question:
Now the details:
I have this program:
data = DataFrames.readtable("...") # a big baby (~100MB)
filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame
filtered_data = @parallel vcat for fct in filter_functions
fct(data)::DataFrame
end
It works nice functionality wise, but each parallel call to fct(data) on another worker copies the whole data frame, making everything painfully slow.
Ideally, I would like to load the data once, and always use each on each worker the pre-loaded data. I came up with this code to do so:
@everywhere data = DataFrames.readtable("...") # a big baby (~100MB)
@everywhere filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame
@everywhere for i in 1:length(filter_functions)
if (myid()-1) % nworkers()
fct = filter_functions[i]
filtered_data_temp = fct(data)
end
# How to vcat all the filtered_data_temp ?
end
But now I have another problem: I cannot figure out how to vcat() all the filtered_data_temp onto a variable in the worker with myid()==1.
I would very much appreciate any insight.
Note: I am aware of Operating in parallel on a large constant datastructure in Julia. Yet, I don't believe it applies to my problem because all my filter_functions do operate on the array as a whole.
The concept of parallel computing is based on dividing a large problem into smaller ones and each of them is carried out by one single processor individually. In addition, these processes are performed concurrently in a distributed and parallel manner.
2.1. The performance of any parallel application is ultimately bounded by the speed, capacity and interfaces of each processing element. Programming a parallel computer depends on how the memory of the hardware platform is organized or divided among the processors.
Starting Julia with multiple threadsThe number of execution threads is controlled either by using the -t / --threads command line argument or by using the JULIA_NUM_THREADS environment variable. When both are specified, then -t / --threads takes precedence.
Julia's multi-threading provides the ability to schedule Tasks simultaneously on more than one thread or CPU core, sharing memory. This is usually the easiest way to get parallelism on one's PC or on a single large multi-core server.
You might want to look into/load your data into Distributed Arrays
EDIT: Probably something like this:
data = DataFrames.readtable("...")
dfiltered_data = distribute(data) #distributes data among processes automagically
filter_functions = [ fct1, fct2, fct3 ... ]
for fct in filter_functions
dfiltered_data = fct(dfiltered_data)::DataFrame
end
You can also check the unit tests for more examples
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With