Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Emulating a browser to download a file?

There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.

Using wget, I can download the file from a different address, but not from the address that works in Chrome.

This is what I've tried to do:

url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'  

wget -c  $url --user-agent="" -O  xfgs.f4v

This doesn't work either:

wget -c  $url   -O  xfgs.f4v

The output is:

Connecting to 118.26.57.12:80... connected.  
HTTP request sent, awaiting response... 403 Forbidden  
2013-02-13 09:50:42 ERROR 403: Forbidden.  

What am I doing wrong?

I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:

import mechanize  
br = mechanize.Browser()  
br = mechanize.Browser()  
br.set_handle_robots(False)  
br.set_handle_equiv(False)   
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]  
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302' 
r = br.open(url).read()  
tofile=open("/tmp/xfgs.f4v","w")  
tofile.write(r)  
tofile.close()

This is the result:

Traceback (most recent call last):  
  File "<stdin>", line 1, in <module>  
  File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open  
   return self._mech_open(url, data, timeout=timeout)  
  File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open  
raise response  
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden

Can anyone explain how to get the mechanize code to work please?

like image 514
showkey Avatar asked Feb 13 '13 02:02

showkey


People also ask

How do I download a file instead of open in browser?

Click on "Settings" and you'll see a new page pop up in your Chrome browser window. Scroll down to Advanced Settings, click Downloads, and clear your Auto Open options. Next time you download an item, it will be saved instead of opened automatically.

How can I download directly from a website?

Download directly to Google Drive from URL To download files directly to Google Drive, click on Google Drive and select “Remote Upload” in the dropdown list. Next, in the pop-up window, enter the link address. Click “Save to Cloud” to continue. Wait for it to complete.


2 Answers

First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.

If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.

Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.

Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.

Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.

When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.

Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.

If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.

Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.

Edit: Clarify steps

  1. Investigate how state is being maintained
  2. Pull initial page with python, grab any state info you need from it
  3. Perform any tokenizing that may be required with that state info
  4. Pull the video using the tokens from steps 2 and 3
  5. If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
  6. Return to step 1. until you find a solution

Edit: You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.

I love this kind of problem

like image 82
G. Shearer Avatar answered Sep 21 '22 20:09

G. Shearer


It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing

like image 37
Eric Avatar answered Sep 21 '22 20:09

Eric