As seen in How do I copy a file in Python?, there are many file copy functions: <ul> <li> <code>shutil.copy</code> </li> <li> <code>shutil.copy2</code> </li> <li> <code>shutil.copyfile</code> (and also <code>shutil.copyfileobj</code>) </li> <li> or even a naive method: <pre class="prettyprint"><code>with open('sourcefile', 'rb') as f, open('destfile', 'wb') as g: while True: block = f.read(16*1024*1024) # work by blocks of 16 MB if not block: # EOF break g.write(block) </code></pre> </li> </ul> Among all these methods, which ones are safe in the case of a copy interruption (example: kill the Python process)? The last one in the list looks ok. By safe I mean: if a 1 GB file copy is not 100% finished (let's say it's interrupted in the middle of the copy, after 400MB), the file size should not be reported as 1GB in the filesystem, it should: <ul> <li>either report the size the file had when the last bytes were written (e.g. 400MB)</li> <li>or be deleted</li> </ul> The worst would be that the final filesize is written first (internally with an <code>fallocate</code> or <code>ftruncate</code>?). This would be a problem if the copy is interrupted: by looking at the file-size, we would think the file is correctly written. Many incremental backup programs (I'm coding one) use "filename+mtime+fsize" to check if a file has to be copied or if it's already there (of course a better solution is to SHA256 source and destination files but this is not done for every sync, too much time-consuming; off-topic here). So I want to make sure that the "copy file" function does not store the final file size immediately (then it could fool the <code>fsize</code> comparison), before copying the actual file content. <hr> Note: I'm asking the question because, while <code>shutil.filecopy</code> was rather straighforward on Python 3.7 and below, see source (which is more or less the naive method above), it seems much more complicated on Python 3.9, see source, with many different cases for Windows, Linux, MacOS, "fastcopy" tricks, etc.

Assuming that <code>destfile</code> does not exist prior to the copy, the naive method is safe, per your definition of safe. <code>shutil.copyfileobj()</code> and <code>shutil.copyfile()</code> are close second in the ranking. <code>shutils.copy()</code> is next, and <code>shutils.copy2()</code> would be last. Explanation: It is a filesystem's job to guarantee consistency based on application requests. If you are only writing X bytes to a file, the file size will only account for these X bytes. Therefore, doing direct FS operations like the naive method will work. It is now a matter of what these higher-level functions do with the filesystem. The API doesn't state what happens if python crashes mid-copy, but it is a de-facto expectation from everyone that these functions behave like Unix <code>cp</code>, i.e. don't mess with the file size. Assuming that the maintainers of CPython don't want to break people's expectations, all these functions should be safe per your definition. That said, it isn't guaranteed anywhere, AFAICT. However, <code>shutil.copyfileobj()</code> and <code>shutil.copyfile()</code> expressly have their API promise to not copy metadata, so they're not likely to try and set the size. <code>shutils.copy()</code> wouldn't try to set the file size, only the mode, and in most filesystems setting the size and the mode require two different filesystem operations, so it should still be safe. <code>shutils.copy2()</code> says it will copy metadata, and if you look at its source code, you'll see that it only copies the metadata after copying the data, so even that should be safe. Even more, copying the metadata doesn't copy the size. So this would only be a problem if some of the internal functions python uses try to optimize using <code>ftruncate()</code>, <code>fallocate()</code>, or some such, which is unlikely given that people who write system APIs (like the python maintainers) are very aware of people's expectations.

Among the many Python file copy functions, which ones are safe if the copy is interrupted?

Tags:

python

file

io

file-io

filesystems

As seen in How do I copy a file in Python?, there are many file copy functions:

shutil.copy
shutil.copy2
shutil.copyfile (and also shutil.copyfileobj)

or even a naive method:

with open('sourcefile', 'rb') as f, open('destfile', 'wb') as g:
    while True:
        block = f.read(16*1024*1024)  # work by blocks of 16 MB
        if not block:  # EOF
            break
        g.write(block)

Among all these methods, which ones are safe in the case of a copy interruption (example: kill the Python process)? The last one in the list looks ok.

By safe I mean: if a 1 GB file copy is not 100% finished (let's say it's interrupted in the middle of the copy, after 400MB), the file size should not be reported as 1GB in the filesystem, it should:

either report the size the file had when the last bytes were written (e.g. 400MB)
or be deleted

The worst would be that the final filesize is written first (internally with an fallocate or ftruncate?). This would be a problem if the copy is interrupted: by looking at the file-size, we would think the file is correctly written.

Many incremental backup programs (I'm coding one) use "filename+mtime+fsize" to check if a file has to be copied or if it's already there (of course a better solution is to SHA256 source and destination files but this is not done for every sync, too much time-consuming; off-topic here).

So I want to make sure that the "copy file" function does not store the final file size immediately (then it could fool the fsize comparison), before copying the actual file content.

Note: I'm asking the question because, while shutil.filecopy was rather straighforward on Python 3.7 and below, see source (which is more or less the naive method above), it seems much more complicated on Python 3.9, see source, with many different cases for Windows, Linux, MacOS, "fastcopy" tricks, etc.

441

asked Dec 10 '20 11:12

Basj

1 Answers

Assuming that destfile does not exist prior to the copy, the naive method is safe, per your definition of safe.

shutil.copyfileobj() and shutil.copyfile() are close second in the ranking.

shutils.copy() is next, and shutils.copy2() would be last.

Explanation:

It is a filesystem's job to guarantee consistency based on application requests. If you are only writing X bytes to a file, the file size will only account for these X bytes.

Therefore, doing direct FS operations like the naive method will work.

It is now a matter of what these higher-level functions do with the filesystem.

The API doesn't state what happens if python crashes mid-copy, but it is a de-facto expectation from everyone that these functions behave like Unix cp, i.e. don't mess with the file size.

Assuming that the maintainers of CPython don't want to break people's expectations, all these functions should be safe per your definition.

That said, it isn't guaranteed anywhere, AFAICT.

However, shutil.copyfileobj() and shutil.copyfile() expressly have their API promise to not copy metadata, so they're not likely to try and set the size.

shutils.copy() wouldn't try to set the file size, only the mode, and in most filesystems setting the size and the mode require two different filesystem operations, so it should still be safe.

shutils.copy2() says it will copy metadata, and if you look at its source code, you'll see that it only copies the metadata after copying the data, so even that should be safe. Even more, copying the metadata doesn't copy the size.

So this would only be a problem if some of the internal functions python uses try to optimize using ftruncate(), fallocate(), or some such, which is unlikely given that people who write system APIs (like the python maintainers) are very aware of people's expectations.

189

answered Oct 16 '22 09:10

root

Related questions
                            
                                Adding clippath information to an image
                            
                                Can't upgrade JupyterLab to latest version
                            
                                How to document a flask-restplus response with list of strings
                            
                                How to access Type Hints inside of a method after it's called in Python
                            
                                How to add additional classes to a pre-trained object detection model and train it to detect all of the classes (pre-trained + new)?
                            
                                passing pointer to C++ from python using pybind11
                            
                                Joining Spark DataFrames on a nearest key condition
                            
                                How to convert image file object to numpy array in with openCv python?
                            
                                Can't parse the username to make sure I'm logged in to a website
                            
                                How to test if PyObject has an iterator
                            
                                How does one have parameters in a pytorch model not be leafs and be in the computation graph?
                            
                                Rearrange data in two-dimensional array according to transformation from polar to Cartesian coordinates
                            
                                Hex size in matplotlib hexbins based on density of nearby points
                            
                                Correct setting of Python home and sys.prefix in an embedded environment
                            
                                How can I set up a Pusher server with Flask?
                            
                                Errorbar in Legend - Pandas Bar Plot
                            
                                How to catch python exception and save traceback text as string
                            
                                Is numpy reusing memory from unused arrays?
                            
                                Speed up the initial TensorFlow startup
                            
                                Jupyter notebook error 'dyld library not loaded: CoreFoundation' after macOS Big Sur update

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With