Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel I/O - why does it work?

I have a python function which reads a line from a text file and writes it to another text file. It repeats this for every line in the file. Essentially:

Read line 1 -> Write line 1 -> Read line 2 -> Write line 2...

And so on.

I can parallelise this process, using a queue to pass data, so it is more like:

Read line 1 -> Read line 2 -> Read line 3...

              Write line 1 -> Write line 2....

My question is - why does this work (as in why do I get a speed up?). Sounds like a daft question, but I was thinking - surely my hard disk can only do one thing at once? So why isn't one process put on hold til the other is completed?

Things like this are hidden from the user when writing in a high level language..I'd like to know whats going on at low level?

like image 738
jramm Avatar asked Mar 31 '14 06:03

jramm


People also ask

What is parallel IO used for?

Parallel I/O is one technique used to access data on disk simultaneously from different application processes to maximize bandwidth and speed things up.

How does a parallel system work?

What is parallel computing? Parallel computing uses multiple computer cores to attack several operations at once. Unlike serial computing, parallel architecture can break down a job into its component parts and multi-task them. Parallel computer systems are well suited to modeling and simulating real-world phenomena.

What is parallel input?

Parallel I/O, in the context of a computer, means the performance of multiple input/output operations at the same time, for instance simultaneously outputs to storage devices and display devices. It is a fundamental feature of operating systems.

What is parallel computing and why it required?

Parallel computing refers to the process of executing several processors an application or computation simultaneously. Generally, it is a kind of computing architecture where the large problems break into independent, smaller, usually similar parts that can be processed in one go.


1 Answers

In short: IO buffering. Two levels of it, even.

First, Python itself has IO buffers. So, when you write all those lines to the file, Python doesn't necessarily invoke the write syscall immediately - it does that when it flushes its buffers, which could be anytime from when you call write until you close the file. This clearly won't affect you if you write at such a level as you make the syscalls yourself.

But separate to this, the operating system will also implement buffers. These work the same way - you make the 'write to disk' syscall, the OS puts the data in its write buffer and will use that when other processes read that file back. But it doesn't necessarily write it to disk yet - it can wait, theoretically, until you unmount that filesystem (possibly at shutdown). This is (part of) why it can be a bad idea to unplug a USB storage device without unmounting or 'safely removing' it, for example - things you've written to it aren't necessarily physically on the device yet. Anything the OS does is unaffected by what language you're writing in, or how much of a wrapper around the syscalls you have.

As well as this, both Python and the OS can do read buffering - essentially, when you read one line from the file, Python/the OS anticipates that you might be interested in the next several lines as well, and so reads them into main memory to avoid having to defer all the way down to the disk itself later.

like image 61
lvc Avatar answered Oct 12 '22 12:10

lvc