Emulating a browser to download a file?

Tags:

There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.

Using wget, I can download the file from a different address, but not from the address that works in Chrome.

This is what I've tried to do:

url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&amp;playtype=1&amp;tk=163659644989925531390490125&amp;brt=2&amp;bc=0&amp;nt=0&amp;du=1496650&amp;ispid=23&amp;rc=200&amp;inf=1&amp;si=11000&amp;npc=1606&amp;pp=0&amp;ul=2&amp;mt=-1&amp;sid=10000&amp;au=0&amp;pc=0&amp;cip=222.73.44.31&amp;hf=0&amp;id=tudou&amp;itemid=135558267&amp;fi=163005294&amp;sz=59138302'  

wget -c  $url --user-agent="" -O  xfgs.f4v

This doesn't work either:

wget -c  $url   -O  xfgs.f4v

The output is:

Connecting to 118.26.57.12:80... connected.  
HTTP request sent, awaiting response... 403 Forbidden  
2013-02-13 09:50:42 ERROR 403: Forbidden.

What am I doing wrong?

I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:

import mechanize  
br = mechanize.Browser()  
br = mechanize.Browser()  
br.set_handle_robots(False)  
br.set_handle_equiv(False)   
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]  
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&amp;playtype=1&amp;tk=163659644989925531390490125&amp;brt=2&amp;bc=0&amp;nt=0&amp;du=1496650&amp;ispid=23&amp;rc=200&amp;inf=1&amp;si=11000&amp;npc=1606&amp;pp=0&amp;ul=2&amp;mt=-1&amp;sid=10000&amp;au=0&amp;pc=0&amp;cip=222.73.44.31&amp;hf=0&amp;id=tudou&amp;itemid=135558267&amp;fi=163005294&amp;sz=59138302' 
r = br.open(url).read()  
tofile=open("/tmp/xfgs.f4v","w")  
tofile.write(r)  
tofile.close()

This is the result:

Traceback (most recent call last):  
  File "<stdin>", line 1, in <module>  
  File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open  
   return self._mech_open(url, data, timeout=timeout)  
  File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open  
raise response  
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden

Can anyone explain how to get the mechanize code to work please?

514

asked Feb 13 '13 02:02

showkey

2 Answers

First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.

If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.

Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.

Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.

Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.

When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.

Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.

If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.

Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.

Edit: Clarify steps

Investigate how state is being maintained
Pull initial page with python, grab any state info you need from it
Perform any tokenizing that may be required with that state info
Pull the video using the tokens from steps 2 and 3
If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
Return to step 1. until you find a solution

Edit: You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.

I love this kind of problem

answered Sep 21 '22 20:09

G. Shearer

It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing

answered Sep 21 '22 20:09

Eric

Related questions
                            
                                Remove duplicate chars using regex?
                            
                                Accessing form fields as properties in a django view
                            
                                python: how to convert a query string to json string?
                            
                                Django: How can I create a multiple select form?
                            
                                Python writing binary
                            
                                Axis limits for scatter plot - Matplotlib
                            
                                How to invert black and white with scikit-image?
                            
                                Importing bs4 in Python 3.5
                            
                                Python, How to Send data over TCP
                            
                                Visualize MNIST dataset using OpenCV or Matplotlib/Pyplot
                            
                                assertTrue() in pytest to assert empty lists
                            
                                Exception: "dot" not found in path in python on mac
                            
                                Install issues with 'lr_utils' in python
                            
                                Directory Listing based on time [duplicate]
                            
                                Python: Anyway to use map to get first element of a tuple
                            
                                Warning: The Command Line Tools for Xcode don't appear to be installed; most ports will likely fail to build [closed]
                            
                                Get contents by class names using Beautiful Soup
                            
                                I don't understand encode and decode in Python (2.7.3)
                            
                                Empty list boolean value
                            
                                Finding the currently selected tab of Ttk Notebook

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Emulating a browser to download a file?

Tags:

python

shell

wget

mechanize

showkey

People also ask

2 Answers

G. Shearer

Eric

Recent Activity

Donate For Us