Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue scraping with Beautiful Soup

I've been scraping websites before using this same technique. But with this website it seems to not work.

import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.weatheronline.co.uk/weather/maps/current?LANG=en&DATE=1354104000&CONT=euro&LAND=UK&KEY=UK&SORT=1&INT=06&TYP=sonne&ART=tabelle&RUBRIK=akt&R=310&CEL=C"
page=urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
print soup

In the output should be the content of the webpage but instead I am just getting this:

GIF89a (it follows also some symbols I can't copy here)

Any ideas of what the problem is and how should I proceed.

like image 509
Julio Avatar asked Oct 06 '22 18:10

Julio


1 Answers

but I want to know why I am getting a gif accesing the url like that and when I access it via my browser I get the website perfectly.

because these guys are smart and don't want their website to be accessed outside a web browser. What you need to do is to fake a known browser by adding User-agent to the header. Here is a modified example that will work

>>> import urllib2
>>> opener = urllib2.build_opener()
>>> opener.addheaders = [('User-agent', 'Mozilla/5.0')]
>>> url = "http://www.weatheronline.co.uk/weather/maps/current?LANG=en&DATE=1354104000&CONT=euro&LAND=UK&KEY=UK&SORT=1&INT=06&TYP=sonne&ART=tabelle&RUBRIK=akt&R=310&CEL=C"
>>> response = opener.open(url)
>>> page = response.read()
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(page)
like image 102
Abhijit Avatar answered Oct 10 '22 23:10

Abhijit