Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are thread pools needed for pure Haskell code?

In Real World Haskell, Chapter 28, Software transactional memory, a concurrent web link checker is developed. It fetches all the links in a webpage and hits every once of them with a HEAD request to figure out if the link is active. A concurrent approach is taken to build this program and the following statement is made:

We can't simply create one thread per URL, because that may overburden either our CPU or our network connection if (as we expect) most of the links are live and responsive. Instead, we use a fixed number of worker threads, which fetch URLs to download from a queue.

I do not fully understand why this pool of threads is needed instead of using forkIO for each link. AFAIK, the Haskell runtime maintains a pool of threads and schedules them appropriately so I do not see the CPU being overloaded. Furthermore, in a discussion about concurrency on the Haskell mailing list, I found the following statement going in the same direction:

The one paradigm that makes no sense in Haskell is worker threads (since the RTS does that for us); instead of fetching a worker, just forkIO instead.

Is the pool of threads only required for the network part or there is a CPU reason for it too?

like image 565
Nicolas Dudebout Avatar asked Mar 03 '13 22:03

Nicolas Dudebout


1 Answers

The core issue, I imagine, is the network side. If you have 10,000 links and forkIO for each link, then you potentially have 10,000 sockets you're attempting to open at once, which, depending on how your OS is configured, probably won't even be possible, much less efficient.

However, the fact that we have green threads that get "virtually" scheduled across multiple os threads (which ideally are stuck to individual cores) doesn't mean that we can just distribute work randomly without regards to cpu usage either. The issue here isn't so much that the scheduling of the CPU itself won't be handled for us, but rather that context-switches (even green ones) cost cycles. Each thread, if its working on different data, will need to pull that data into the cpu. If there's enough data, that means pulling things in and out of the cpu cache. Even absent that, it means pulling things from the cache to registers, etc.

Even if a problem is trivially parallel, it is virtually never the right idea to just break it up as small as possible and attempt to do it "all at once".

like image 108
sclv Avatar answered Nov 07 '22 13:11

sclv