Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel computation in Julia with large data

First my question:

  • is it possible to prevent Julia from copying variables each time in a parallel for loop ?
  • if not, how to implement a parallel reduce operations in Julia ?

Now the details:

I have this program:

data = DataFrames.readtable("...") # a big baby (~100MB)
filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame
filtered_data = @parallel vcat for fct in filter_functions
  fct(data)::DataFrame
end

It works nice functionality wise, but each parallel call to fct(data) on another worker copies the whole data frame, making everything painfully slow.

Ideally, I would like to load the data once, and always use each on each worker the pre-loaded data. I came up with this code to do so:

@everywhere data = DataFrames.readtable("...") # a big baby (~100MB)
@everywhere filter_functions = [ fct1, fct2, fct3 ... ] # (x::DataFrame) -> y::DataFrame
@everywhere for i in 1:length(filter_functions)
  if (myid()-1) % nworkers()
    fct = filter_functions[i]
    filtered_data_temp = fct(data)
  end
  # How to vcat all the filtered_data_temp ?
end

But now I have another problem: I cannot figure out how to vcat() all the filtered_data_temp onto a variable in the worker with myid()==1.

I would very much appreciate any insight.

Note: I am aware of Operating in parallel on a large constant datastructure in Julia. Yet, I don't believe it applies to my problem because all my filter_functions do operate on the array as a whole.

like image 560
Antoine Trouve Avatar asked Jul 28 '15 02:07

Antoine Trouve


People also ask

What is parallel computing in big data?

The concept of parallel computing is based on dividing a large problem into smaller ones and each of them is carried out by one single processor individually. In addition, these processes are performed concurrently in a distributed and parallel manner.

What limits the performance of a parallel program?

2.1. The performance of any parallel application is ultimately bounded by the speed, capacity and interfaces of each processing element. Programming a parallel computer depends on how the memory of the hardware platform is organized or divided among the processors.

How do you multithread with Julia?

Starting Julia with multiple threadsThe number of execution threads is controlled either by using the -t / --threads command line argument or by using the JULIA_NUM_THREADS environment variable. When both are specified, then -t / --threads takes precedence.

Does Julia use multiple cores?

Julia's multi-threading provides the ability to schedule Tasks simultaneously on more than one thread or CPU core, sharing memory. This is usually the easiest way to get parallelism on one's PC or on a single large multi-core server.


1 Answers

You might want to look into/load your data into Distributed Arrays

EDIT: Probably something like this:

data = DataFrames.readtable("...")
dfiltered_data = distribute(data) #distributes data among processes automagically
filter_functions = [ fct1, fct2, fct3 ... ] 
for fct in filter_functions
  dfiltered_data = fct(dfiltered_data)::DataFrame
end

You can also check the unit tests for more examples

like image 92
Felipe Lema Avatar answered Dec 11 '22 15:12

Felipe Lema