A while ago, I made a Python script which looked similar to this:
with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
for line in f:
w.write(line)
Which, of course, worked pretty slowly on a 100mb
file.
However, I changed the program to do this
ls = []
with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
for line in f:
ls.append(line)
if len(ls) == 100000:
w.writelines(ls)
del ls[:]
And the file copied much faster. My question is, why does the second method work faster even though the program copies the same number of lines (albeit collects them and prints them one by one)?
If we are cutting(moving) within a same disk, then it will be faster than copying because only the file path is modified, actual data is on the disk. If the data is copied from one disk to another, it will be relatively faster than cutting because it is doing only COPY operation. Save this answer.
The quality of the target drive can play a role in the speed. Newer drives—depending on the type of port they use—allow for faster data transfer. But all in all, you shouldn't see a difference in speed when copying files on the same drive or outside of it.
A shallow copy constructs a new compound object and then (to the extent possible) inserts references into it to the objects found in the original. A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.
A shallow copy means constructing a new collection object and then populating it with references to the child objects found in the original. In essence, a shallow copy is only one level deep. The copying process does not recurse and therefore won't create copies of the child objects themselves.
I may have found a reason why write
is slower than writelines
. In looking through the CPython source (3.4.3) I found the code for the write
function (took out irrelevent parts).
Modules/_io/fileio.c
static PyObject *
fileio_write(fileio *self, PyObject *args)
{
Py_buffer pbuf;
Py_ssize_t n, len;
int err;
...
n = write(self->fd, pbuf.buf, len);
...
PyBuffer_Release(&pbuf);
if (n < 0) {
if (err == EAGAIN)
Py_RETURN_NONE;
errno = err;
PyErr_SetFromErrno(PyExc_IOError);
return NULL;
}
return PyLong_FromSsize_t(n);
}
If you notice, this function actually returns a value, the size of the string that has been written, which is another function call.
I tested this out to see if it actually had a return value, and it did.
with open('test.txt', 'w+') as f:
x = f.write("hello")
print(x)
>>> 5
The following is the code for the writelines
function implementation in CPython (took out irrelevent parts).
Modules/_io/iobase.c
static PyObject *
iobase_writelines(PyObject *self, PyObject *args)
{
PyObject *lines, *iter, *res;
...
while (1) {
PyObject *line = PyIter_Next(iter);
...
res = NULL;
do {
res = PyObject_CallMethodObjArgs(self, _PyIO_str_write, line, NULL);
} while (res == NULL && _PyIO_trap_eintr());
Py_DECREF(line);
if (res == NULL) {
Py_DECREF(iter);
return NULL;
}
Py_DECREF(res);
}
Py_DECREF(iter);
Py_RETURN_NONE;
}
If you notice, there is no return value! It simply has Py_RETURN_NONE
instead of another function call to calculate the size of the written value.
So, I went ahead and tested that there really wasn't a return value.
with open('test.txt', 'w+') as f:
x = f.writelines(["hello", "hello"])
print(x)
>>> None
The extra time that write
takes seems to be due to the extra function call taken in the implementation to produce the return value. By using writelines
, you skip that step and the fileio is the only bottleneck.
Edit: write
documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With