Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between wget and requests.get with respect to file download

I have an weird error. There's a file on dropbox which i'm downloading with the following python code:

import requests
import shutil

url = 'https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0'
r = requests.get(url, stream=True)
path_to_save = "/tmp/data.dload-1"
with open(path_to_save, 'wb') as f:
    shutil.copyfileobj(r.raw, f)  

this downloads to /tmp/data.dload-1.

same file downloaded with wget wget https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0 -O /tmp/data.dload-2

these two files have the same type:

(dl)x:x$ file /tmp/data.dload-1 
/tmp/data.dload-1: gzip compressed data, from Unix
(dl)x:x$ file /tmp/data.dload-2 
/tmp/data.dload-2: gzip compressed data, last modified: Thu Apr 26 23:05:15 2018, from Unix

but un-taring them produces different results:

(dl)x:x$ tar -zxvf /tmp/data.dload-1
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
(dl) x:x$ tar -zxvf /tmp/data.dload-2
testfiles/a
testfiles/b
(dl)x:x$ 

anybody has any idea why this might happen and more importantly how can i download that tar file with Python (preferably requests)

This is the result from r.headers: (dl) x:x$ python dload-test.py {'Server': 'nginx', 'Date': 'Fri, 27 Apr 2018 17:27:06 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'no-cache', 'Content-Security-Policy': "script-src 'unsafe-eval' https://www.dropbox.com/static/compiled/js/ https://www.dropbox.com/static/javascript/ https://www.dropbox.com/static/api/ https://cfl.dropboxstatic.com/static/compiled/js/ https://www.dropboxstatic.com/static/compiled/js/ https://cfl.dropboxstatic.com/static/js/ https://www.dropboxstatic.com/static/js/ https://cfl.dropboxstatic.com/static/previews/ https://www.dropboxstatic.com/static/previews/ https://cfl.dropboxstatic.com/static/api/ https://www.dropboxstatic.com/static/api/ https://cfl.dropboxstatic.com/static/cms/ https://www.dropboxstatic.com/static/cms/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ 'unsafe-inline' ; img-src https://* data: blob: ; frame-ancestors 'self' ; default-src 'none' ; frame-src https://* carousel://* dbapi-6://* dbapi-7://* dbapi-8://* itms-apps://* itms-appss://* ; worker-src https://www.dropbox.com/static/serviceworker/ blob: ; style-src https://* 'unsafe-inline' 'unsafe-eval' ; connect-src https://* ws://127.0.0.1:*/ws ; object-src 'self' https://cfl.dropboxstatic.com/static/ https://www.dropboxstatic.com/static/ https://flash.dropboxstatic.com https://swf.dropboxstatic.com https://dbxlocal.dropboxstatic.com ; media-src https://* blob: ; font-src https://* data: ; child-src https://www.dropbox.com/static/serviceworker/ blob: ; form-action 'self' https://www.dropbox.com/ https://dl-web.dropbox.com/ https://photos.dropbox.com/ https://accounts.google.com/ https://api.login.yahoo.com/ https://login.yahoo.com/ ; base-uri 'self' api-stream.dropbox.com showbox-tr.dropbox.com ; report-uri https://www.dropbox.com/csp_log", 'Dropbox-Streaming': 'V=1', 'Pragma': 'no-cache', 'Referrer-Policy': 'origin-when-cross-origin', 'Set-Cookie': 'locale=en; Domain=dropbox.com; expires=Wed, 26 Apr 2023 17:27:06 GMT; Path=/; secure, gvc=OTU0NjExNzUwNjc0NjQxNzgwMzE0OTgzMzkzNjc3MzM5OTYzNzc%3D; expires=Wed, 26 Apr 2023 17:27:06 GMT; httponly; Path=/; secure, flash=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, puc=; expires=Fri, 27 Apr 2018 17:27:06 GMT; httponly; Path=/; secure, bang=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, seen-sl-signup-modal=VHJ1ZQ%3D%3D; expires=Sun, 27 May 2018 17:27:06 GMT; httponly; Path=/; secure, t=HlsAKcFI_HJWteio0_5ELyFf; Domain=dropbox.com; expires=Mon, 26 Apr 2021 17:27:06 GMT; httponly; Path=/; secure, __Host-js_csrf=HlsAKcFI_HJWteio0_5ELyFf; expires=Mon, 26 Apr 2021 17:27:06 GMT; Path=/; secure', 'X-Content-Type-Options': 'nosniff', 'X-Dropbox-Request-Id': 'b028e94ce7b814c7f25fb753449b641a', 'X-Frame-Options': 'DENY', 'X-Robots-Tag': 'noindex, nofollow, noimageindex', 'X-Xss-Protection': '1; mode=block', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains', 'Content-Encoding': 'gzip'}

like image 394
src Avatar asked Dec 18 '22 23:12

src


1 Answers

The problem that the file is being gzip-compressed, even though it's already a gzipped file (as can be seen from the 'Content-Encoding': 'gzip' field in r.headers).

You're using the default request headers, for both requests and wget. Both of them will, by default, send something like 'Accept-Encoding: gzip, deflate'. (You can see this if you print out r.request.headers.) So the server can easily gzip the file and send it back with a 'Content-Encoding: gzip' header.

Both wget and requests will, by default, detect that header and transparently decode the data for you—but you've explicitly told requests not to do that, and read the raw data as-is.

So you end up saving a file which is a gzip-compressed-gzip-compressed-tarball. Obviously, file will report that as gzip compressed data, and tar -z will report that what's inside the gzip does not look like a tar archive, because it isn't, it's a gzipped tar archive.

The smallest fix here is to manually add headers={'Accept-Encoding': 'identity'} to your request.


You may wonder why the server is bothering to gzip-compress a gzipped file—just because you told it you can accept gzip doesn't mean you're demanding gzip, right?

If you look at RFC 2616 and RFC 7231, the server is supposed to pick the encoding with the highest qvalue (weight) as specified by the client that it can support (breaking ties according to some heuristic that isn't specified). If your user agent explicitly asks for 'gzip, deflate', giving you identity would be incorrect unless it's actually impossible to do otherwise, not slightly silly.

like image 68
abarnert Avatar answered Dec 28 '22 08:12

abarnert