python urllib2 - wait for page to finish loading/redirecting before scraping?

Question

I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2. However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.

Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?

import urllib2
from bs4 import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()

Edit: The answer is thorough, however, in the end what solved my problem was this: https://stackoverflow.com/a/3210737/1157283

Samy Vilar · Accepted Answer

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns

import urllib2
from BeautifulSoup import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())

test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available

Travel/Hotel API's? it looks they might, though with some restrictions.

But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.

I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

hope that helps.

As a side note:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

python urllib2 - wait for page to finish loading/redirecting before scraping?

Tags:

python

urllib2

Ken

1 Answers

Samy Vilar

Recent Activity

Donate For Us

python urllib2 - wait for page to finish loading/redirecting before scraping?

Tags:

python

urllib2

Ken

1 Answers

Samy Vilar

Related questions

Recent Activity

Donate For Us