Parallel computation in Julia with large data

Tags:

parallel-processing

julia

First my question:

is it possible to prevent Julia from copying variables each time in a parallel for loop ?
if not, how to implement a parallel reduce operations in Julia ?

Now the details:

I have this program:

data = DataFrames.readtable("...") # a big baby (~100MB)
filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame
filtered_data = @parallel vcat for fct in filter_functions
  fct(data)::DataFrame
end

It works nice functionality wise, but each parallel call to fct(data) on another worker copies the whole data frame, making everything painfully slow.

Ideally, I would like to load the data once, and always use each on each worker the pre-loaded data. I came up with this code to do so:

@everywhere data = DataFrames.readtable("...") # a big baby (~100MB)
@everywhere filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame
@everywhere for i in 1:length(filter_functions)
  if (myid()-1) % nworkers()
    fct = filter_functions[i]
    filtered_data_temp = fct(data)
  end
  # How to vcat all the filtered_data_temp ?
end

But now I have another problem: I cannot figure out how to vcat() all the filtered_data_temp onto a variable in the worker with myid()==1.

I would very much appreciate any insight.

Note: I am aware of Operating in parallel on a large constant datastructure in Julia. Yet, I don't believe it applies to my problem because all my filter_functions do operate on the array as a whole.

560

asked Jul 28 '15 02:07

Antoine Trouve

1 Answers

You might want to look into/load your data into Distributed Arrays

EDIT: Probably something like this:

data = DataFrames.readtable("...")
dfiltered_data = distribute(data) #distributes data among processes automagically
filter_functions = [ fct1, fct2, fct3 ... ] 
for fct in filter_functions
  dfiltered_data = fct(dfiltered_data)::DataFrame
end

You can also check the unit tests for more examples

answered Dec 11 '22 15:12

Felipe Lema

Related questions
                            
                                Why would parallelization decrease performance so dramatically?
                            
                                Choose Akka or Spark for parallel processing? [closed]
                            
                                Task.StartNew() vs Parallel.ForEach : Multiple Web Requests Scenario
                            
                                Can generating permutations be done in parallel?
                            
                                Are pointers private in OpenMP parallel sections?
                            
                                What is the most robust way to append text to a single file from multiple connections
                            
                                Task Parallel is unstable, using 100% CPU at times
                            
                                Shared array usage in Julia
                            
                                Parallel.ForEach while retaining order
                            
                                Why is Spark not using all cores on local machine
                            
                                Do reducers (in Clojure) address the scaling foldr accumulation issue outlined by Guy Steele?
                            
                                Implementation of model parallelism in tensorflow
                            
                                Aria2c parallel download parameters
                            
                                Xvfb multiple displays for parallel processing?
                            
                                How do applicative functors tie in with parallelizing algorithms? (Scala and Scalaz)
                            
                                What is the advantage (if any) of MPI + threads parallelization vs. MPI-only?
                            
                                Difference between parallel map and parallel for-loop
                            
                                parallel R execution problem in R
                            
                                Parallel graphics processing in Haskell
                            
                                Parallelism in functional languages

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With