Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Threading vs Forking (with explanation of what I want to do)

Tags:

perl

So, I've reviewed a ton of articles and forums before posting this, but I keep reading conflicting answers. Firstly, OS is not an issue, I can use either Windows or Unix, whatever would be best for my problem. I have a ton of data that I need to use for read-only purposes (not sure why this would matter, but, in case it does, the data structure that I'm going to have to go through is an array of arrays of arrays of hashes whose values are also arrays). I'm essentially comparing a "query" to a ton of different "sentences" and computing their relative similarities. From these quantities (several million), I want to take the top x% and do something with them. I need to parallelize this process. There's just no good way for me to decrease the space--I need to compare over everything to get good results and it will just take too long with some sort of threading/forking. Again, I've seen many conflicting answers and don't know which one to do.

Any help would be appreciated. Thanks in advance.

EDIT: I don't think the amount of memory usage will be an issue, but I don't know (8 GB RAM)

like image 586
Steve P. Avatar asked Apr 28 '13 00:04

Steve P.


People also ask

What is the difference between thread and fork?

Threads are functions run in parallel, fork is a new process with parents inheritance. Threads are good to execute a task in parallel, while forks are independent process, that also are running simultaneously.

What is the purpose of forking a process?

fork() is how you create new processes in Unix. When you call fork , you're creating a copy of your own process that has its own address space. This allows multiple tasks to run independently of one another as though they each had the full memory of the machine to themselves.

What does forking a thread mean?

More generally, a fork in a multithreading environment means that a thread of execution is duplicated, creating a child thread from the parent thread. They are identical but can be told apart. The fork operation creates a separate address space for the child.

What are some advantages of using a thread instead of a separate process?

On a multiprocessor system, multiple threads can concurrently run on multiple CPUs. Therefore, multithreaded programs can run much faster than on a uniprocessor system. They can also be faster than a program using multiple processes, because threads require fewer resources and generate less overhead.


2 Answers

Without more detail on your problem, there's not much help that can be given. You want to parallelize a process. Threads and forks in Perl have advantages and disadvantages.

One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.

When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.

Forking Advantages

  • Very fast to create a fork
  • Very robust

Forking Disadvantages

  • Communicating between the processes can be slow and awkward

Thread Advantages

  • Thread coordination and data interchange is fairly easy
  • Threads are fairly easy to use

Thread Disadvantages

  • Each thread takes a lot of memory
  • Threads can be slow to start
  • Threads can be buggy (better the more recent your perl)
  • Database connections are not shared across threads

That last one is a bit of a doozy if the documentation is up to date. If you're going to be doing a lot of SQL, don't use threads.

In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.

Really what it comes down to is what fits your way of thinking and your particular problem.

For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.

For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut.

When reading articles about Perl threads, keep in mind they were a bit crap when they were introduced in 5.8.0 in 2002, and only serviceable by 5.10.1. After that they've firmed up considerably. Information and opinions about their efficiency and robustness tends to fall rapidly out of date.

like image 132
Schwern Avatar answered Oct 21 '22 07:10

Schwern


Threading can be more difficult to get correct, but won't utilize as much memory.

Forking can be simpler to implement but use a significant amount of memory.

If you don't have experience with either, I would start by implemented a forking version & go from there.

like image 23
They Call Me Bruce Avatar answered Oct 21 '22 08:10

They Call Me Bruce