A while ago, I made a Python script which looked similar to this: <pre class="prettyprint"><code>with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w: for line in f: w.write(line) </code></pre> Which, of course, worked pretty slowly on a <code>100mb</code> file. However, I changed the program to do this <pre class="prettyprint"><code>ls = [] with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w: for line in f: ls.append(line) if len(ls) == 100000: w.writelines(ls) del ls[:] </code></pre> And the file copied much faster. My question is, why does the second method work faster even though the program copies the same number of lines (albeit collects them and prints them one by one)?

I may have found a reason why <code>write</code> is slower than <code>writelines</code>. In looking through the CPython source (3.4.3) I found the code for the <code>write</code> function (took out irrelevent parts). <code>Modules/_io/fileio.c</code> <pre class="prettyprint"><code>static PyObject * fileio_write(fileio *self, PyObject *args) { Py_buffer pbuf; Py_ssize_t n, len; int err; ... n = write(self->fd, pbuf.buf, len); ... PyBuffer_Release(&pbuf); if (n < 0) { if (err == EAGAIN) Py_RETURN_NONE; errno = err; PyErr_SetFromErrno(PyExc_IOError); return NULL; } return PyLong_FromSsize_t(n); } </code></pre> If you notice, this function actually returns a value, the size of the string that has been written, which is another function call. I tested this out to see if it actually had a return value, and it did. <pre class="prettyprint"><code>with open('test.txt', 'w+') as f: x = f.write("hello") print(x) >>> 5 </code></pre> The following is the code for the <code>writelines</code> function implementation in CPython (took out irrelevent parts). <code>Modules/_io/iobase.c</code> <pre class="prettyprint"><code>static PyObject * iobase_writelines(PyObject *self, PyObject *args) { PyObject *lines, *iter, *res; ... while (1) { PyObject *line = PyIter_Next(iter); ... res = NULL; do { res = PyObject_CallMethodObjArgs(self, _PyIO_str_write, line, NULL); } while (res == NULL && _PyIO_trap_eintr()); Py_DECREF(line); if (res == NULL) { Py_DECREF(iter); return NULL; } Py_DECREF(res); } Py_DECREF(iter); Py_RETURN_NONE; } </code></pre> If you notice, there is no return value! It simply has <code>Py_RETURN_NONE</code> instead of another function call to calculate the size of the written value. So, I went ahead and tested that there really wasn't a return value. <pre class="prettyprint"><code>with open('test.txt', 'w+') as f: x = f.writelines(["hello", "hello"]) print(x) >>> None </code></pre> The extra time that <code>write</code> takes seems to be due to the extra function call taken in the implementation to produce the return value. By using <code>writelines</code>, you skip that step and the fileio is the only bottleneck. Edit: <code>write</code> documentation

Why does copying a file line by line greatly affect copy speed in Python?

Tags:

python

file

A while ago, I made a Python script which looked similar to this:

with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
    for line in f:
        w.write(line)

Which, of course, worked pretty slowly on a 100mb file.

However, I changed the program to do this

ls = []
with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
    for line in f:
        ls.append(line)
        if len(ls) == 100000:
            w.writelines(ls)
            del ls[:]

And the file copied much faster. My question is, why does the second method work faster even though the program copies the same number of lines (albeit collects them and prints them one by one)?

927

asked Jul 27 '15 15:07

ytpillai

1 Answers

I may have found a reason why write is slower than writelines. In looking through the CPython source (3.4.3) I found the code for the write function (took out irrelevent parts).

Modules/_io/fileio.c

static PyObject *
fileio_write(fileio *self, PyObject *args)
{
    Py_buffer pbuf;
    Py_ssize_t n, len;
    int err;
    ...
    n = write(self->fd, pbuf.buf, len);
    ...

    PyBuffer_Release(&pbuf);

    if (n < 0) {
        if (err == EAGAIN)
            Py_RETURN_NONE;
        errno = err;
        PyErr_SetFromErrno(PyExc_IOError);
        return NULL;
    }

    return PyLong_FromSsize_t(n);
}

If you notice, this function actually returns a value, the size of the string that has been written, which is another function call.

I tested this out to see if it actually had a return value, and it did.

with open('test.txt', 'w+') as f:
    x = f.write("hello")
    print(x)

>>> 5

The following is the code for the writelines function implementation in CPython (took out irrelevent parts).

Modules/_io/iobase.c

static PyObject *
iobase_writelines(PyObject *self, PyObject *args)
{
    PyObject *lines, *iter, *res;

    ...

    while (1) {
        PyObject *line = PyIter_Next(iter);
        ...
        res = NULL;
        do {
            res = PyObject_CallMethodObjArgs(self, _PyIO_str_write, line, NULL);
        } while (res == NULL && _PyIO_trap_eintr());
        Py_DECREF(line);
        if (res == NULL) {
            Py_DECREF(iter);
            return NULL;
        }
        Py_DECREF(res);
    }
    Py_DECREF(iter);
    Py_RETURN_NONE;
}

If you notice, there is no return value! It simply has Py_RETURN_NONE instead of another function call to calculate the size of the written value.

So, I went ahead and tested that there really wasn't a return value.

with open('test.txt', 'w+') as f:
    x = f.writelines(["hello", "hello"])
    print(x)

>>> None

The extra time that write takes seems to be due to the extra function call taken in the implementation to produce the return value. By using writelines, you skip that step and the fileio is the only bottleneck.

Edit: write documentation

182

answered Oct 21 '22 22:10

Brobin

Related questions
                            
                                Sphinx autodoc show-inheritance: How to skip undocumented, intermediate bases?
                            
                                Serializing ManyToMany in Django Rest Framework (2.3.5) throws ValueError
                            
                                numpy.memmap from numpy operations
                            
                                pytest: How to pass a class parameter to setup_class
                            
                                Error Message with Chrome Webdriver via Selenium: "Allowing web_page contexts requires supplying a value for matches"
                            
                                RabbitMQ IOError: Socket closed
                            
                                Render Jinja2 macro without bothering what's in the rest of the template
                            
                                Celery/CloudAMQP error in a Heroku Flask App
                            
                                Converting large SAS dataset to hdf5
                            
                                Communicate with Firewire ports in python
                            
                                Install package dependencies with setup.py and wheels
                            
                                rearranging rows in a big numpy array zeros some rows. How to fix it?
                            
                                Could gunicorn cause an issue with exscript/paramiko?
                            
                                Python: Different behavior in run vs debug mode in PyCharm
                            
                                django-pipeline - Page load really slow
                            
                                How to disconnect from elasticsearch-py client/connection-pool
                            
                                Django UncompressableFileError
                            
                                Google App Engine inter module communication authorization
                            
                                Pandas: garbage-collect drop'ped columns to release memory
                            
                                no django app created when following the docker-compose tutorial

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With