Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Task Parallel Library for directory traversal

I'd like to traverse a directory on my hard drive and search through all the files for a specific search string. This sounds like the perfect candidate for something that could (or should) be done in parallel since the IO is rather slow.

Traditionally, I would write a recursive function to finds and processes all files in the current directory and then recurse into all the directories in that directory. I'm wondering how I can modify this to be more parallel. At first I simply modified:

foreach (string directory in directories) { ... }

to

Parallel.ForEach(directories, (directory) => { ... }) 

but I feel that this might create too many tasks and get itself into knots, especially when trying to dispatch back onto a UI thread. I also feel that the number of tasks is unpredictable and that this might not be an efficient way to parallize (is that a word?) this task.

Has anyone successfully done something like this before? What advice do you have in doing so?

like image 407
rein Avatar asked Nov 10 '10 22:11

rein


People also ask

What is Task Parallel Library?

The Task Parallel Library (TPL) is a set of public types and APIs in the System. Threading and System. Threading. Tasks namespaces. The purpose of the TPL is to make developers more productive by simplifying the process of adding parallelism and concurrency to applications.

Which of the following is not handled by the Task Parallel Library TPL?

Task parallel library does not handles the race conditions by default.

What is task parallelism in C#?

The Task Parallel Library (TPL) is based on the concept of a task, which represents an asynchronous operation. In some ways, a task resembles a thread or ThreadPool work item but at a higher level of abstraction. The term task parallelism refers to one or more independent tasks running concurrently.

How are threads different from TPL?

Compared to the classic threading model in . NET, Task Parallel Library minimizes the complexity of using threads and provides an abstraction through a set of APIs that help developers focus more on the application program instead of focusing on how the threads will be provisioned.


1 Answers

No, this doesn't sound like a good candidate for parallelism precisely because the IO is slow. You're going to be diskbound. Assuming you've only got one disk, you don't really want to be making it seek to multiple different places at the same time.

It's a bit like trying to attach several hoses to the same tap in order to get water out faster - or trying to run 16 CPU-bound threads on a single core :)

like image 67
Jon Skeet Avatar answered Sep 22 '22 01:09

Jon Skeet