Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does copying a file line by line greatly affect copy speed in Python?

Tags:

python

file

A while ago, I made a Python script which looked similar to this:

with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
    for line in f:
        w.write(line)

Which, of course, worked pretty slowly on a 100mb file.

However, I changed the program to do this

ls = []
with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
    for line in f:
        ls.append(line)
        if len(ls) == 100000:
            w.writelines(ls)
            del ls[:]

And the file copied much faster. My question is, why does the second method work faster even though the program copies the same number of lines (albeit collects them and prints them one by one)?

like image 927
ytpillai Avatar asked Jul 27 '15 15:07

ytpillai


People also ask

Why is cutting faster than copying?

If we are cutting(moving) within a same disk, then it will be faster than copying because only the file path is modified, actual data is on the disk. If the data is copied from one disk to another, it will be relatively faster than cutting because it is doing only COPY operation. Save this answer.

Is copying files slower than moving?

The quality of the target drive can play a role in the speed. Newer drives—depending on the type of port they use—allow for faster data transfer. But all in all, you shouldn't see a difference in speed when copying files on the same drive or outside of it.

What is the difference between copy and Deepcopy in Python?

A shallow copy constructs a new compound object and then (to the extent possible) inserts references into it to the objects found in the original. A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.

What is a shallow copy in Python?

A shallow copy means constructing a new collection object and then populating it with references to the child objects found in the original. In essence, a shallow copy is only one level deep. The copying process does not recurse and therefore won't create copies of the child objects themselves.


1 Answers

I may have found a reason why write is slower than writelines. In looking through the CPython source (3.4.3) I found the code for the write function (took out irrelevent parts).

Modules/_io/fileio.c

static PyObject *
fileio_write(fileio *self, PyObject *args)
{
    Py_buffer pbuf;
    Py_ssize_t n, len;
    int err;
    ...
    n = write(self->fd, pbuf.buf, len);
    ...

    PyBuffer_Release(&pbuf);

    if (n < 0) {
        if (err == EAGAIN)
            Py_RETURN_NONE;
        errno = err;
        PyErr_SetFromErrno(PyExc_IOError);
        return NULL;
    }

    return PyLong_FromSsize_t(n);
}

If you notice, this function actually returns a value, the size of the string that has been written, which is another function call.

I tested this out to see if it actually had a return value, and it did.

with open('test.txt', 'w+') as f:
    x = f.write("hello")
    print(x)

>>> 5

The following is the code for the writelines function implementation in CPython (took out irrelevent parts).

Modules/_io/iobase.c

static PyObject *
iobase_writelines(PyObject *self, PyObject *args)
{
    PyObject *lines, *iter, *res;

    ...

    while (1) {
        PyObject *line = PyIter_Next(iter);
        ...
        res = NULL;
        do {
            res = PyObject_CallMethodObjArgs(self, _PyIO_str_write, line, NULL);
        } while (res == NULL && _PyIO_trap_eintr());
        Py_DECREF(line);
        if (res == NULL) {
            Py_DECREF(iter);
            return NULL;
        }
        Py_DECREF(res);
    }
    Py_DECREF(iter);
    Py_RETURN_NONE;
}

If you notice, there is no return value! It simply has Py_RETURN_NONE instead of another function call to calculate the size of the written value.

So, I went ahead and tested that there really wasn't a return value.

with open('test.txt', 'w+') as f:
    x = f.writelines(["hello", "hello"])
    print(x)

>>> None

The extra time that write takes seems to be due to the extra function call taken in the implementation to produce the return value. By using writelines, you skip that step and the fileio is the only bottleneck.

Edit: write documentation

like image 182
Brobin Avatar answered Oct 21 '22 22:10

Brobin