Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to pull HTML from website

I'm pulling HTML from web sites, by sending headers to make the site think I'm just a user surfing the site, like so:

def page(goo):
    import fileinput
    import sys, heapq, array, urllib
    import BeautifulSoup
    from BeautifulSoup import BeautifulSoup
    import re
    from urllib import FancyURLopener
    class MyOpener(FancyURLopener):
        version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
    myopener = MyOpener()
    filehandle = myopener.open(goo)
    return filehandle.read()
page=page(WebSite)

This works perfectly with most websites, even Google and Wikipedia, but not with Tmart.com. Somehow, Tmart can see it's not a web browser, and returns an error. How can I fix this?

like image 210
user1849106 Avatar asked Sep 19 '25 10:09

user1849106


1 Answers

They might be detecting that you don't have a JavaScript interpreter? Hard to tell without seeing the error message you are receiving. There is one method that is guaranteed to work though. And that is directly driving a browser using Selenium Webdriver.

Selenium is normally used for functionally testing web sites. But works very well for scraping sites that use JavaScript as well.

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://www.someurl.com')

html = browser.page_source

See all the methods available on browser here: http://code.google.com/p/selenium/source/browse/trunk/py/selenium/webdriver/remote/webdriver.py For this to work you will also need to have the chromedriver executable available: http://code.google.com/p/chromedriver/downloads/list

like image 196
aychedee Avatar answered Sep 20 '25 22:09

aychedee