Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to use threads to speed up file reading?

Tags:

I want to read a file as fast as possible (40k lines) [Edit : the rest is obsolete].

Edit: Andres Jaan Tack suggested a solution based on one thread per file, and I want to be sure I got this (thus this is the fastest way) :

  • One thread per entry file reads it whole and stocks its content in a container associated (-> as many containers as there are entry files)
  • One thread calculates the linear combination of every cell read by the input threads, and stocks the results in the exit container (associated to the output file).
  • One thread writes by block (every 4kB of data, so about 10 lines) the content of the output container.

Should I deduce that I must not use m-mapped files (because the program's on standby waiting for the data) ?

Thanks aforehand.

Sincerely,

Mister mystère.

like image 459
Mister Mystère Avatar asked Jun 16 '10 14:06

Mister Mystère


People also ask

Do threads increase performance?

You can also use threads to improve appeared performance (or responsiveness) in an interactive application. You run heavy computations on a background thread to avoid blocking UI interactions.

Does multithreading make faster?

In General: Multi threading may improve throughput of the application by using more CPU power. it depends on a lot of factors. If not, the performance depends on above factors and throughput will vary between single threaded application and multi-threading application.

Do threads increase FPS?

SYNOPSIS: Intel Hyper-Threading Technology can boost processing performance of a system by up to 30%. Hyper-threading creates two logical processors from one physical processor core.

Can two threads read at the same time?

This means two threads can read the same value and get different results creating a race condition. This can be prevented though memory barriers, correct use of volatile or a few other mechanisms.


1 Answers

Your question got a little bit deeper, when you asked further. I'll try to cover all your options...

Reading One File: How many threads?

Use one thread.

If you read straight through a file front-to-back from a single thread, the operating system will not fetch the file in small chunks like you're thinking. Rather, it will prefetch the file ahead of you in huge (exponentially growing) chunks, so you almost never pay a penalty for going to disk. You might wait for the disk a handful of times, but in general it will be like the file was already in memory, and this is even irrespective of mmap.

The OS is very good at this kind of sequential file reading, because it's predictable. When you read a file from multiple threads, you're essentially reading randomly, which is (obviously) less predictable. Prefetchers tend to be much less effective with random reads, in this case probably making the whole application slower instead of faster.

Notice: This is even before you add the cost of setting up the threads and all the rest of it. That costs something, too, but it's basically nothing compared with the cost of more blocking disk accesses.

Reading Multiple Files: How many threads?

Use as many threads as you have files (or some reasonable number).

File prefetching done separately for each open file. Once you start reading multiple files, you should read from several of them in parallel. This works because the disk I/O Scheduler will try to figure out the fastest order in which to read all of them in. Often, there's a disk scheduler both in the OS and on the hard drive itself. Meanwhile, the prefetcher can still do its job.

Reading several files in parallel is always better than reading the files one-by-one. If you did read them one at a time, your disk would idle between prefetches; that's valuable time to read more data into memory! The only way you can go wrong is if you have too little RAM to support many open files; that's not common, anymore.

A word of caution: If you're too overzealous with your multiple file reads, reading one file will start kicking bits of other files out of memory, and you're back to a random-read situation.

Combining n Files into One.

Processing and producing output from multiple threads might work, but it depends how you need to combine them. You'll have to be careful about how you synchronize the threads, in any case, though there are surely some relatively easy lock-less ways to do that.

One thing to look for, though: Don't bother writing the file in small (< 4K) blocks. Collect at least 4K of data at a time before you call write(). Also, since the kernel will lock the file when you write it, don't call write() from all of your threads together; they'll all wait for each other instead of processing more data.

like image 134
Andres Jaan Tack Avatar answered Sep 24 '22 00:09

Andres Jaan Tack