I'm downloading an entire directory from a web server. It works OK, but I can't figure how to get the file size before download to compare if it was updated on the server or not. Can this be done as if I was downloading the file from a FTP server? <pre class="prettyprint"><code>import urllib import re url = "http://www.someurl.com" # Download the page locally f = urllib.urlopen(url) html = f.read() f.close() f = open ("temp.htm", "w") f.write (html) f.close() # List only the .TXT / .ZIP files fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE) for fname in fnames: print fname, "..." f = urllib.urlopen(url + "/" + fname) #### Here I want to check the filesize to download or not #### file = f.read() f.close() f = open (fname, "w") f.write (file) f.close() </code></pre> <hr> @Jon: thank for your quick answer. It works, but the filesize on the web server is slightly less than the filesize of the downloaded file. Examples: <pre class="prettyprint"><code>Local Size Server Size 2.223.533 2.115.516 664.603 662.121 </code></pre> It has anything to do with the CR/LF conversion?

I have reproduced what you are seeing: <pre class="prettyprint"><code>import urllib, os link = "http://python.org" print "opening url:", link site = urllib.urlopen(link) meta = site.info() print "Content-Length:", meta.getheaders("Content-Length")[0] f = open("out.txt", "r") print "File on disk:",len(f.read()) f.close() f = open("out.txt", "w") f.write(site.read()) site.close() f.close() f = open("out.txt", "r") print "File on disk after download:",len(f.read()) f.close() print "os.stat().st_size returns:", os.stat("out.txt").st_size </code></pre> Outputs this: <pre class="prettyprint"><code>opening url: http://python.org Content-Length: 16535 File on disk: 16535 File on disk after download: 16535 os.stat().st_size returns: 16861 </code></pre> What am I doing wrong here? Is os.stat().st_size not returning the correct size? <hr> Edit: OK, I figured out what the problem was: <pre class="prettyprint"><code>import urllib, os link = "http://python.org" print "opening url:", link site = urllib.urlopen(link) meta = site.info() print "Content-Length:", meta.getheaders("Content-Length")[0] f = open("out.txt", "rb") print "File on disk:",len(f.read()) f.close() f = open("out.txt", "wb") f.write(site.read()) site.close() f.close() f = open("out.txt", "rb") print "File on disk after download:",len(f.read()) f.close() print "os.stat().st_size returns:", os.stat("out.txt").st_size </code></pre> this outputs: <pre class="prettyprint"><code>$ python test.py opening url: http://python.org Content-Length: 16535 File on disk: 16535 File on disk after download: 16535 os.stat().st_size returns: 16535 </code></pre> Make sure you are opening both files for binary read/write. <pre class="prettyprint"><code>// open for binary write open(filename, "wb") // open for binary read open(filename, "rb") </code></pre>

Get size of a file before downloading in Python

Tags:

python

urllib

I'm downloading an entire directory from a web server. It works OK, but I can't figure how to get the file size before download to compare if it was updated on the server or not. Can this be done as if I was downloading the file from a FTP server?

import urllib import re  url = "http://www.someurl.com"  # Download the page locally f = urllib.urlopen(url) html = f.read() f.close()  f = open ("temp.htm", "w") f.write (html) f.close()  # List only the .TXT / .ZIP files fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE)  for fname in fnames:     print fname, "..."      f = urllib.urlopen(url + "/" + fname)      #### Here I want to check the filesize to download or not ####      file = f.read()     f.close()      f = open (fname, "w")     f.write (file)     f.close()

@Jon: thank for your quick answer. It works, but the filesize on the web server is slightly less than the filesize of the downloaded file.

Examples:

Local Size  Server Size  2.223.533  2.115.516    664.603    662.121

It has anything to do with the CR/LF conversion?

324

asked Aug 08 '08 13:08

PabloG

1 Answers

I have reproduced what you are seeing:

import urllib, os link = "http://python.org" print "opening url:", link site = urllib.urlopen(link) meta = site.info() print "Content-Length:", meta.getheaders("Content-Length")[0]  f = open("out.txt", "r") print "File on disk:",len(f.read()) f.close()   f = open("out.txt", "w") f.write(site.read()) site.close() f.close()  f = open("out.txt", "r") print "File on disk after download:",len(f.read()) f.close()  print "os.stat().st_size returns:", os.stat("out.txt").st_size

Outputs this:

opening url: http://python.org Content-Length: 16535 File on disk: 16535 File on disk after download: 16535 os.stat().st_size returns: 16861

What am I doing wrong here? Is os.stat().st_size not returning the correct size?

Edit: OK, I figured out what the problem was:

import urllib, os link = "http://python.org" print "opening url:", link site = urllib.urlopen(link) meta = site.info() print "Content-Length:", meta.getheaders("Content-Length")[0]  f = open("out.txt", "rb") print "File on disk:",len(f.read()) f.close()   f = open("out.txt", "wb") f.write(site.read()) site.close() f.close()  f = open("out.txt", "rb") print "File on disk after download:",len(f.read()) f.close()  print "os.stat().st_size returns:", os.stat("out.txt").st_size

this outputs:

$ python test.py opening url: http://python.org Content-Length: 16535 File on disk: 16535 File on disk after download: 16535 os.stat().st_size returns: 16535

Make sure you are opening both files for binary read/write.

// open for binary write open(filename, "wb") // open for binary read open(filename, "rb")

answered Oct 15 '22 06:10

Jonathan Works

Related questions
                            
                                Django Rest Framework - Get related model field in serializer
                            
                                Python logging before you run logging.basicConfig?
                            
                                Python List of np arrays to array
                            
                                Combine (join) networkx Graphs
                            
                                Is there a head and tail method for Numpy array?
                            
                                What is the best way to write the contents of a StringIO to a file?
                            
                                What is the difference between an 'sdist' .tar.gz distribution and an python egg?
                            
                                Inverse Cosine in Python
                            
                                Unexpected '{' in field name when doing string formatting
                            
                                pandas - find first occurrence
                            
                                Generator functions equivalent in Java
                            
                                Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
                            
                                Creating a Colormap Legend in Matplotlib
                            
                                How Do I Keep Python Code Under 80 Chars Without Making It Ugly?
                            
                                How can I use functools.singledispatch with instance methods?
                            
                                ImportError: No module named 'spacy.en'
                            
                                Calling base class method in Python
                            
                                jinja2.exceptions.TemplateNotFound error [duplicate]
                            
                                Can I import a CSV file and automatically infer the delimiter?
                            
                                Python, counter atomic increment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With