Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelize file reading and writing

I have a program which reads data from 2 text files and then save the result to another file. Since there are many data to be read and written which cause a performance hit, I want to parallize the reading and writing operations.

My initial thought is, use 2 threads as an example, one thread read/write from the beginning, and another thread read/write from the middle of the file. Since my files are formatted as lines, not bytes(each line may have different bytes of data), seek by byte does not work for me. And the solution I could think of is use getline() to skip over the previous lines first, which might be not efficient.

Is there any good way to seek to a specified line in a file? or do you have any other ideas to parallize file reading and writing?

Environment: Win32, C++, NTFS, Single Hard Disk

Thanks.

-Dbger

like image 668
Baiyan Huang Avatar asked Jan 03 '10 02:01

Baiyan Huang


People also ask

Is it possible to open a file for reading and writing at the same time?

'r+' opens the file for both reading and writing. On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Also reading then writing works equally well using 'r+b' mode, but you have to use f.

Can you read a file in parallel?

When reading in parallel, each partition of the graph will process part of the file. A file split (or more commonly split) is a contiguous segment of a data file, spanning a range of bytes. DataFlow performs parallel reads by breaking files into a number of splits and assigning them to different partitions.

Can you write to disk in parallel?

It says that it is not possible to write in parallel and that we should use only one thread for writing to the disk, because multiple threads create overhead.


2 Answers

Generally speaking, you do NOT want to parallelize disk I/O. Hard disks do not like random I/O because they have to continuously seek around to get to the data. Assuming you're not using RAID, and you're using hard drives as opposed to some solid state memory, you will see a severe performance degradation if you parallelize I/O(even when using technologies like those, you can still see some performance degradation when doing lots of random I/O).

To answer your second question, there really isn't a good way to seek to a certain line in a file; you can only explicitly seek to a byte offset using the read function(see this page for more details on how to use it.

like image 59
Mike Avatar answered Dec 04 '22 19:12

Mike


Queuing multiple reads and writes won't help when you're running against one disk. If your app also performed a lot of work in CPU then you could do your reads and writes asynchronously and let the CPU work while the disk I/O occurs in the background. Alternatively, get a second physical hard drive: read from one, write to the other. For modestly sized data sets that's often effective and quite a bit cheaper than writing code.

like image 20
Curt Nichols Avatar answered Dec 04 '22 17:12

Curt Nichols