Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selenium Webdriver / Beautifulsoup + Web Scraping + Error 416

I'm doing web scraping using selenium webdriver in Python with Proxy.

I want to browse more than 10k pages of single site using this scraping.

Issue is using this proxy I'm able to send request for single time only. when I'm sending another request on same link or another link of this site, I'm getting 416 error (kind of block IP using firewall) for 1-2 hours.

Note: I'm able to do scraping all normal sites with this code, but this site has kind of security which is prevent me for scraping.

Here is code.

profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference(
                "network.proxy.http", "74.73.148.42")
profile.set_preference("network.proxy.http_port", 3128)
profile.update_preferences()
browser = webdriver.Firefox(firefox_profile=profile)
browser.get('http://www.example.com/')
time.sleep(5)
element = browser.find_elements_by_css_selector(
                '.well-sm:not(.mbn) .row .col-md-4 ul .fs-small a')
for ele in element:
    print ele.get_attribute('href')
browser.quit()

Any solution ??

like image 252
Mitul Shah Avatar asked Sep 23 '15 11:09

Mitul Shah


2 Answers

Selenium wasn't helpful for me, so I solved the problem by using beautifulsoup, the website has used security to block proxy whenever received request, so I am continuously changing proxyurl and User-Agent whenever server blocking requested proxy.

I'm pasting my code here

from bs4 import BeautifulSoup
import requests
import urllib2

url = 'http://terriblewebsite.com/'

proxy = urllib2.ProxyHandler({'http': '130.0.89.75:8080'})

# Create an URL opener utilizing proxy
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15')
result = urllib2.urlopen(request)
data = result.read()
soup = BeautifulSoup(data, 'html.parser')
ptag = soup.find('p', {'class', 'text-primary'}).text
print ptag

Note :

  1. change proxy and User-Agent and use latest updated proxy only

  2. few server are accepting only specific country proxy, In my case I used Proxies from United States

this process might be a slow, still u can scrap the data

like image 102
Mitul Shah Avatar answered Nov 18 '22 07:11

Mitul Shah


Going through the 416 error issues in the following links, it seems that some cached information(cookies maybe) is creating the issues. You are able to send request for the first time and subsequent send requests fail.

https://webmasters.stackexchange.com/questions/17300/what-are-the-causes-of-a-416-error 416 Requested Range Not Satisfiable

Try choosing not to save cookies by setting a preference or deleting the cookies after every send request.

profile.set_preference("network.cookie.cookieBehavior", 2);
like image 35
Sighil Avatar answered Nov 18 '22 06:11

Sighil