Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommended language for multithreaded data work

Right now, I use a combination of Python and R for all of my data processing needs. However, some of my datasets are incredibly large and would benefit strongly from multithreaded processing.

For example, if there are two steps that each have to performed on a set of several millions of data points, I would like to be able to start the second step while the first step is still being run, using the part of the data that has already been processed through the first step.

From my understanding, neither Python nor R is the ideal language for this type of work (at least, I don't know how to implement it in either language). What would be the best language/implementation for this type of data processing?

like image 746
chimeracoder Avatar asked Aug 17 '10 22:08

chimeracoder


People also ask

Which language is best for multi-threading?

While other systems have provided facilities for multithreading (usually via "lightweight process" libraries), building multithreading support into the language as Java has done provides the programmer with a much more powerful tool for easily creating thread-safe multithreaded classes.

Is Python good for multithreaded?

Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.

Which language is best for parallel computing?

I suggest to use C (or C++) as high level language, and MPI and OpenMP as parallel libraries. These languages are standard and portable, and these parallel libraries allow to apply parallel and distributed computing in a wide range of parallel systems (from a single-node multi-core processor to a cluster of many nodes.

Is Java good for multithreaded?

Java has great support for multithreaded applications. Java supports multithreading through Thread class. Java Thread allows us to create a lightweight process that executes some tasks. We can create multiple threads in our program and start them.


1 Answers

It is possible to do this in Python using the multiprocessing module -- this spawns multiple processes instead of threads, which bypasses the GIL and hence allows true concurrency.

That is not to say that Python is the 'best' language for this job; that's a subjective point which can be argued over. But it is certainly capable of it.

EDIT: Yes, there are several ways to share data between processes. Pipes are the simplest; they are sort-of file-like handles which one process can write to and then another can read from. Straight from the docs:

from multiprocessing import Process, Pipe

def f(conn):
    conn.send([42, None, 'hello'])
    conn.close()

if __name__ == '__main__':
    parent_conn, child_conn = Pipe()
    p = Process(target=f, args=(child_conn,))
    p.start()
    print parent_conn.recv()   # prints "[42, None, 'hello']"
    p.join()

You could for instance have one process performing the first step and sending the results down a pipe to another process for the second step.

like image 50
Katriel Avatar answered Sep 23 '22 01:09

Katriel