I'm pulling HTML from web sites, by sending headers to make the site think I'm just a user surfing the site, like so:
def page(goo):
import fileinput
import sys, heapq, array, urllib
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import re
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
filehandle = myopener.open(goo)
return filehandle.read()
page=page(WebSite)
This works perfectly with most websites, even Google and Wikipedia, but not with Tmart.com. Somehow, Tmart can see it's not a web browser, and returns an error. How can I fix this?
They might be detecting that you don't have a JavaScript interpreter? Hard to tell without seeing the error message you are receiving. There is one method that is guaranteed to work though. And that is directly driving a browser using Selenium Webdriver.
Selenium is normally used for functionally testing web sites. But works very well for scraping sites that use JavaScript as well.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.someurl.com')
html = browser.page_source
See all the methods available on browser here: http://code.google.com/p/selenium/source/browse/trunk/py/selenium/webdriver/remote/webdriver.py For this to work you will also need to have the chromedriver executable available: http://code.google.com/p/chromedriver/downloads/list
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With