I have an weird error. There's a file on dropbox which i'm downloading with the following python code:
import requests
import shutil
url = 'https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0'
r = requests.get(url, stream=True)
path_to_save = "/tmp/data.dload-1"
with open(path_to_save, 'wb') as f:
shutil.copyfileobj(r.raw, f)
this downloads to /tmp/data.dload-1
.
same file downloaded with wget wget https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0 -O /tmp/data.dload-2
these two files have the same type:
(dl)x:x$ file /tmp/data.dload-1
/tmp/data.dload-1: gzip compressed data, from Unix
(dl)x:x$ file /tmp/data.dload-2
/tmp/data.dload-2: gzip compressed data, last modified: Thu Apr 26 23:05:15 2018, from Unix
but un-taring them produces different results:
(dl)x:x$ tar -zxvf /tmp/data.dload-1
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
(dl) x:x$ tar -zxvf /tmp/data.dload-2
testfiles/a
testfiles/b
(dl)x:x$
anybody has any idea why this might happen and more importantly how can i download that tar file with Python
(preferably requests
)
This is the result from r.headers
:
(dl) x:x$ python dload-test.py
{'Server': 'nginx', 'Date': 'Fri, 27 Apr 2018 17:27:06 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'no-cache', 'Content-Security-Policy': "script-src 'unsafe-eval' https://www.dropbox.com/static/compiled/js/ https://www.dropbox.com/static/javascript/ https://www.dropbox.com/static/api/ https://cfl.dropboxstatic.com/static/compiled/js/ https://www.dropboxstatic.com/static/compiled/js/ https://cfl.dropboxstatic.com/static/js/ https://www.dropboxstatic.com/static/js/ https://cfl.dropboxstatic.com/static/previews/ https://www.dropboxstatic.com/static/previews/ https://cfl.dropboxstatic.com/static/api/ https://www.dropboxstatic.com/static/api/ https://cfl.dropboxstatic.com/static/cms/ https://www.dropboxstatic.com/static/cms/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ 'unsafe-inline' ; img-src https://* data: blob: ; frame-ancestors 'self' ; default-src 'none' ; frame-src https://* carousel://* dbapi-6://* dbapi-7://* dbapi-8://* itms-apps://* itms-appss://* ; worker-src https://www.dropbox.com/static/serviceworker/ blob: ; style-src https://* 'unsafe-inline' 'unsafe-eval' ; connect-src https://* ws://127.0.0.1:*/ws ; object-src 'self' https://cfl.dropboxstatic.com/static/ https://www.dropboxstatic.com/static/ https://flash.dropboxstatic.com https://swf.dropboxstatic.com https://dbxlocal.dropboxstatic.com ; media-src https://* blob: ; font-src https://* data: ; child-src https://www.dropbox.com/static/serviceworker/ blob: ; form-action 'self' https://www.dropbox.com/ https://dl-web.dropbox.com/ https://photos.dropbox.com/ https://accounts.google.com/ https://api.login.yahoo.com/ https://login.yahoo.com/ ; base-uri 'self' api-stream.dropbox.com showbox-tr.dropbox.com ; report-uri https://www.dropbox.com/csp_log", 'Dropbox-Streaming': 'V=1', 'Pragma': 'no-cache', 'Referrer-Policy': 'origin-when-cross-origin', 'Set-Cookie': 'locale=en; Domain=dropbox.com; expires=Wed, 26 Apr 2023 17:27:06 GMT; Path=/; secure, gvc=OTU0NjExNzUwNjc0NjQxNzgwMzE0OTgzMzkzNjc3MzM5OTYzNzc%3D; expires=Wed, 26 Apr 2023 17:27:06 GMT; httponly; Path=/; secure, flash=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, puc=; expires=Fri, 27 Apr 2018 17:27:06 GMT; httponly; Path=/; secure, bang=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, seen-sl-signup-modal=VHJ1ZQ%3D%3D; expires=Sun, 27 May 2018 17:27:06 GMT; httponly; Path=/; secure, t=HlsAKcFI_HJWteio0_5ELyFf; Domain=dropbox.com; expires=Mon, 26 Apr 2021 17:27:06 GMT; httponly; Path=/; secure, __Host-js_csrf=HlsAKcFI_HJWteio0_5ELyFf; expires=Mon, 26 Apr 2021 17:27:06 GMT; Path=/; secure', 'X-Content-Type-Options': 'nosniff', 'X-Dropbox-Request-Id': 'b028e94ce7b814c7f25fb753449b641a', 'X-Frame-Options': 'DENY', 'X-Robots-Tag': 'noindex, nofollow, noimageindex', 'X-Xss-Protection': '1; mode=block', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains', 'Content-Encoding': 'gzip'}
The problem that the file is being gzip-compressed, even though it's already a gzipped file (as can be seen from the 'Content-Encoding': 'gzip'
field in r.headers
).
You're using the default request headers, for both requests
and wget
. Both of them will, by default, send something like 'Accept-Encoding: gzip, deflate'
. (You can see this if you print out r.request.headers
.) So the server can easily gzip the file and send it back with a 'Content-Encoding: gzip'
header.
Both wget
and requests
will, by default, detect that header and transparently decode the data for you—but you've explicitly told requests
not to do that, and read the raw data as-is.
So you end up saving a file which is a gzip-compressed-gzip-compressed-tarball. Obviously, file
will report that as gzip compressed data
, and tar -z
will report that what's inside the gzip does not look like a tar archive
, because it isn't, it's a gzipped tar archive.
The smallest fix here is to manually add headers={'Accept-Encoding': 'identity'}
to your request.
You may wonder why the server is bothering to gzip-compress a gzipped file—just because you told it you can accept gzip doesn't mean you're demanding gzip, right?
If you look at RFC 2616 and RFC 7231, the server is supposed to pick the encoding with the highest qvalue (weight) as specified by the client that it can support (breaking ties according to some heuristic that isn't specified). If your user agent explicitly asks for 'gzip, deflate'
, giving you identity
would be incorrect unless it's actually impossible to do otherwise, not slightly silly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With