I'm doing web scraping using selenium webdriver in Python with Proxy.
I want to browse more than 10k pages of single site using this scraping.
Issue is using this proxy I'm able to send request for single time only. when I'm sending another request on same link or another link of this site, I'm getting 416 error (kind of block IP using firewall) for 1-2 hours.
Note: I'm able to do scraping all normal sites with this code, but this site has kind of security which is prevent me for scraping.
Here is code.
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference(
"network.proxy.http", "74.73.148.42")
profile.set_preference("network.proxy.http_port", 3128)
profile.update_preferences()
browser = webdriver.Firefox(firefox_profile=profile)
browser.get('http://www.example.com/')
time.sleep(5)
element = browser.find_elements_by_css_selector(
'.well-sm:not(.mbn) .row .col-md-4 ul .fs-small a')
for ele in element:
print ele.get_attribute('href')
browser.quit()
Any solution ??
Selenium wasn't helpful for me, so I solved the problem by using beautifulsoup, the website has used security to block proxy whenever received request, so I am continuously changing proxyurl and User-Agent whenever server blocking requested proxy.
I'm pasting my code here
from bs4 import BeautifulSoup
import requests
import urllib2
url = 'http://terriblewebsite.com/'
proxy = urllib2.ProxyHandler({'http': '130.0.89.75:8080'})
# Create an URL opener utilizing proxy
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15')
result = urllib2.urlopen(request)
data = result.read()
soup = BeautifulSoup(data, 'html.parser')
ptag = soup.find('p', {'class', 'text-primary'}).text
print ptag
Note :
change proxy and User-Agent and use latest updated proxy only
few server are accepting only specific country proxy, In my case I used Proxies from United States
this process might be a slow, still u can scrap the data
Going through the 416 error issues in the following links, it seems that some cached information(cookies maybe) is creating the issues. You are able to send request for the first time and subsequent send requests fail.
https://webmasters.stackexchange.com/questions/17300/what-are-the-causes-of-a-416-error 416 Requested Range Not Satisfiable
Try choosing not to save cookies by setting a preference or deleting the cookies after every send request.
profile.set_preference("network.cookie.cookieBehavior", 2);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With