Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Safely extract zip or tar using Python

I'm trying to extract user-submitted zip and tar files to a directory. The documentation for zipfile's extractall method (similarly with tarfile's extractall) states that it's possible for paths to be absolute or contain .. paths that go outside the destination path. Instead, I could use extract myself, like this:

some_path = '/destination/path'
some_zip = '/some/file.zip'
zipf = zipfile.ZipFile(some_zip, mode='r')
for subfile in zipf.namelist():
    zipf.extract(subfile, some_path)

Is this safe? Is it possible for a file in the archive to wind up outside of some_path in this case? If so, what way can I ensure that files will never wind up outside the destination directory?

like image 449
jterrace Avatar asked Apr 08 '12 03:04

jterrace


People also ask

How do I extract a tar file in Python?

How to extract a single file from tar file in Python? To extract only a single file from a zipped folder, we can use the extractfile() method of the tarfile module. This method takes a filename as an argument and extracts the file in our working directory.

Is ZIP file built in Python?

Python also provides a high-level module called zipfile specifically designed to create, read, write, extract, and list the content of ZIP files. In this tutorial, you'll learn about Python's zipfile and how to use it effectively.


4 Answers

Note: Starting with python 2.7.4, this is a non-issue for ZIP archives. Details at the bottom of the answer. This answer focuses on tar archives.

To figure out where a path really points to, use os.path.abspath() (but note the caveat about symlinks as path components). If you normalize a path from your zipfile with abspath and it does not contain the current directory as a prefix, it's pointing outside it.

But you also need to check the value of any symlink extracted from your archive (both tarfiles and unix zipfiles can store symlinks). This is important if you are worried about a proverbial "malicious user" that would intentionally bypass your security, rather than an application that simply installs itself in system libraries.

That's the aforementioned caveat: abspath will be misled if your sandbox already contains a symlink that points to a directory. Even a symlink that points within the sandbox can be dangerous: The symlink sandbox/subdir/foo -> .. points to sandbox, so the path sandbox/subdir/foo/../.bashrc should be disallowed. The easiest way to do so is to wait until the previous files have been extracted and use os.path.realpath(). Fortunately extractall() accepts a generator, so this is easy to do.

Since you ask for code, here's a bit that explicates the algorithm. It prohibits not only the extraction of files to locations outside the sandbox (which is what was requested), but also the creation of links inside the sandbox that point to locations outside the sandbox. I'm curious to hear if anyone can sneak any stray files or links past it.

import tarfile
from os.path import abspath, realpath, dirname, join as joinpath
from sys import stderr

resolved = lambda x: realpath(abspath(x))

def badpath(path, base):
    # joinpath will ignore base if path is absolute
    return not resolved(joinpath(base,path)).startswith(base)

def badlink(info, base):
    # Links are interpreted relative to the directory containing the link
    tip = resolved(joinpath(base, dirname(info.name)))
    return badpath(info.linkname, base=tip)

def safemembers(members):
    base = resolved(".")

    for finfo in members:
        if badpath(finfo.name, base):
            print >>stderr, finfo.name, "is blocked (illegal path)"
        elif finfo.issym() and badlink(finfo,base):
            print >>stderr, finfo.name, "is blocked: Hard link to", finfo.linkname
        elif finfo.islnk() and badlink(finfo,base):
            print >>stderr, finfo.name, "is blocked: Symlink to", finfo.linkname
        else:
            yield finfo

ar = tarfile.open("testtar.tar")
ar.extractall(path="./sandbox", members=safemembers(ar))
ar.close()

Edit: Starting with python 2.7.4, this is a non-issue for ZIP archives: The method zipfile.extract() prohibits the creation of files outside the sandbox:

Note: If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar becomes foo/bar on Unix, and C:\foo\bar becomes foo\bar on Windows. And all ".." components in a member filename will be removed, e.g.: ../../foo../../ba..r becomes foo../ba..r. On Windows, illegal characters (:, <, >, |, ", ?, and *) [are] replaced by underscore (_).

The tarfile class has not been similarly sanitized, so the above answer still apllies.

like image 164
alexis Avatar answered Oct 06 '22 00:10

alexis


Use ZipFile.infolist()/TarFile.next()/TarFile.getmembers() to get the information about each entry in the archive, normalize the path, open the file yourself, use ZipFile.open()/TarFile.extractfile() to get a file-like for the entry, and copy the entry data yourself.

like image 42
Ignacio Vazquez-Abrams Avatar answered Oct 05 '22 23:10

Ignacio Vazquez-Abrams


Contrary to the popular answer, unzipping files safely is not completely solved as of Python 2.7.4. The extractall method is still dangerous and can lead to path traversal, either directly or through the unzipping of symbolic links. Here was my final solution which should prevent both attacks in all versions of Python, even versions prior to Python 2.7.4 where the extract method was vulnerable:

import zipfile, os

def safe_unzip(zip_file, extract_path='.'):
    with zipfile.ZipFile(zip_file, 'r') as zf:
        for member in zf.infolist():
            file_path = os.path.realpath(os.path.join(extract_path, member.filename))
            if file_path.startswith(os.path.realpath(extract_path)):
                zf.extract(member, extract_path)

Edit 1: Fixed variable name clash. Thanks Juuso Ohtonen.

Edit 2: s/abspath/realpath/g. Thanks TheLizzard

like image 26
shellster Avatar answered Oct 06 '22 01:10

shellster


Copy the zipfile to an empty directory. Then use os.chroot to make that directory the root directory. Then unzip there.

Alternatively, you can call unzip itself with the -j flag, which ignores the directories:

import subprocess
filename = '/some/file.zip'
rv = subprocess.call(['unzip', '-j', filename])
like image 28
Roland Smith Avatar answered Oct 05 '22 23:10

Roland Smith