Right now, I use a combination of Python and R for all of my data processing needs. However, some of my datasets are incredibly large and would benefit strongly from multithreaded processing.
For example, if there are two steps that each have to performed on a set of several millions of data points, I would like to be able to start the second step while the first step is still being run, using the part of the data that has already been processed through the first step.
From my understanding, neither Python nor R is the ideal language for this type of work (at least, I don't know how to implement it in either language). What would be the best language/implementation for this type of data processing?
While other systems have provided facilities for multithreading (usually via "lightweight process" libraries), building multithreading support into the language as Java has done provides the programmer with a much more powerful tool for easily creating thread-safe multithreaded classes.
Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.
I suggest to use C (or C++) as high level language, and MPI and OpenMP as parallel libraries. These languages are standard and portable, and these parallel libraries allow to apply parallel and distributed computing in a wide range of parallel systems (from a single-node multi-core processor to a cluster of many nodes.
Java has great support for multithreaded applications. Java supports multithreading through Thread class. Java Thread allows us to create a lightweight process that executes some tasks. We can create multiple threads in our program and start them.
It is possible to do this in Python using the multiprocessing
module -- this spawns multiple processes instead of threads, which bypasses the GIL and hence allows true concurrency.
That is not to say that Python is the 'best' language for this job; that's a subjective point which can be argued over. But it is certainly capable of it.
EDIT: Yes, there are several ways to share data between processes. Pipes are the simplest; they are sort-of file-like handles which one process can write to and then another can read from. Straight from the docs:
from multiprocessing import Process, Pipe
def f(conn):
conn.send([42, None, 'hello'])
conn.close()
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
print parent_conn.recv() # prints "[42, None, 'hello']"
p.join()
You could for instance have one process performing the first step and sending the results down a pipe to another process for the second step.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With