Parallel I/O - why does it work?

Tags:

I have a python function which reads a line from a text file and writes it to another text file. It repeats this for every line in the file. Essentially:

Click to copy

Read line 1 -> Write line 1 -> Read line 2 -> Write line 2...

And so on.

I can parallelise this process, using a queue to pass data, so it is more like:

Click to copy

Read line 1 -> Read line 2 -> Read line 3...

              Write line 1 -> Write line 2....

My question is - why does this work (as in why do I get a speed up?). Sounds like a daft question, but I was thinking - surely my hard disk can only do one thing at once? So why isn't one process put on hold til the other is completed?

Things like this are hidden from the user when writing in a high level language..I'd like to know whats going on at low level?

738

asked Mar 31 '14 06:03

jramm

1 Answers

In short: IO buffering. Two levels of it, even.

First, Python itself has IO buffers. So, when you write all those lines to the file, Python doesn't necessarily invoke the write syscall immediately - it does that when it flushes its buffers, which could be anytime from when you call write until you close the file. This clearly won't affect you if you write at such a level as you make the syscalls yourself.

But separate to this, the operating system will also implement buffers. These work the same way - you make the 'write to disk' syscall, the OS puts the data in its write buffer and will use that when other processes read that file back. But it doesn't necessarily write it to disk yet - it can wait, theoretically, until you unmount that filesystem (possibly at shutdown). This is (part of) why it can be a bad idea to unplug a USB storage device without unmounting or 'safely removing' it, for example - things you've written to it aren't necessarily physically on the device yet. Anything the OS does is unaffected by what language you're writing in, or how much of a wrapper around the syscalls you have.

As well as this, both Python and the OS can do read buffering - essentially, when you read one line from the file, Python/the OS anticipates that you might be interested in the next several lines as well, and so reads them into main memory to avoid having to defer all the way down to the disk itself later.

answered Oct 12 '22 12:10

lvc

Related questions
                            
                                How to put text inside a box on a plot in matplotlib
                            
                                Using Python Mock library to spy on internal method calls
                            
                                Relation between 2D KDE bandwidth in sklearn vs bandwidth in scipy
                            
                                Extraploation with 'nearest' method in Python
                            
                                NameError: name 'app' is not defined with Flask
                            
                                How to serialize hierarchical relationship in Django REST
                            
                                How do I get the modern style matplotlib plots often seen in iPython Notebook examples?
                            
                                Normalizing a random unending unknown series?
                            
                                Parsing Very Large XML Files Using Multiprocessing
                            
                                Pandas generate date range of Beginning Month
                            
                                Using Python to run executable and fill in user input
                            
                                Running flask-migrate on heroku produces error
                            
                                Usage of scipy.optimize.fmin_slsqp
                            
                                Why does periodically pressing the enter key substantially speed up my code?
                            
                                Python's sh module - Running Command wrapper in background
                            
                                Psycopg2 install with pip works but cannot import module on OS X 10.9
                            
                                Fastest way to read large file(>8GB) and dump data into dictionary and load it again
                            
                                How do I make a resizeable window with a sidepanel and content area?
                            
                                what does exclude in the meta class of django mean?
                            
                                How to resize column to content in ReportLab?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel I/O - why does it work?

Tags:

python

io

parallel-processing

jramm

People also ask

1 Answers

lvc

Recent Activity

Donate For Us