Selenium Webdriver / Beautifulsoup + Web Scraping + Error 416

Question

I'm doing web scraping using selenium webdriver in Python with Proxy.

I want to browse more than 10k pages of single site using this scraping.

Issue is using this proxy I'm able to send request for single time only. when I'm sending another request on same link or another link of this site, I'm getting 416 error (kind of block IP using firewall) for 1-2 hours.

Note: I'm able to do scraping all normal sites with this code, but this site has kind of security which is prevent me for scraping.

Here is code.

profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference(
                "network.proxy.http", "74.73.148.42")
profile.set_preference("network.proxy.http_port", 3128)
profile.update_preferences()
browser = webdriver.Firefox(firefox_profile=profile)
browser.get('http://www.example.com/')
time.sleep(5)
element = browser.find_elements_by_css_selector(
                '.well-sm:not(.mbn) .row .col-md-4 ul .fs-small a')
for ele in element:
    print ele.get_attribute('href')
browser.quit()

Any solution ??

Mitul Shah · Accepted Answer

Selenium wasn't helpful for me, so I solved the problem by using beautifulsoup, the website has used security to block proxy whenever received request, so I am continuously changing proxyurl and User-Agent whenever server blocking requested proxy.

I'm pasting my code here

from bs4 import BeautifulSoup
import requests
import urllib2

url = 'http://terriblewebsite.com/'

proxy = urllib2.ProxyHandler({'http': '130.0.89.75:8080'})

# Create an URL opener utilizing proxy
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15')
result = urllib2.urlopen(request)
data = result.read()
soup = BeautifulSoup(data, 'html.parser')
ptag = soup.find('p', {'class', 'text-primary'}).text
print ptag

Note :

change proxy and User-Agent and use latest updated proxy only
few server are accepting only specific country proxy, In my case I used Proxies from United States

this process might be a slow, still u can scrap the data

Sighil · Answer

Going through the 416 error issues in the following links, it seems that some cached information(cookies maybe) is creating the issues. You are able to send request for the first time and subsequent send requests fail.

https://webmasters.stackexchange.com/questions/17300/what-are-the-causes-of-a-416-error 416 Requested Range Not Satisfiable

Try choosing not to save cookies by setting a preference or deleting the cookies after every send request.

profile.set_preference("network.cookie.cookieBehavior", 2);

Selenium Webdriver / Beautifulsoup + Web Scraping + Error 416

Tags:

python

beautifulsoup

selenium-webdriver

web-scraping

scrapy

Mitul Shah

2 Answers

Mitul Shah

Sighil

Recent Activity

Donate For Us

Selenium Webdriver / Beautifulsoup + Web Scraping + Error 416

Tags:

python

beautifulsoup

selenium-webdriver

web-scraping

scrapy

Mitul Shah

2 Answers

Mitul Shah

Sighil

Related questions

Recent Activity

Donate For Us