I am using Python
to scrape pages. Until now I didn't have any complicated issues.
The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.
Using Requests
and lxml
I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.
I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300)
call on every 300th request). Despite that, Im getting banned.
From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.
Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.
If you would switch to the Scrapy
web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
scrapy-fake-useragent
middleware:Use a random User-Agent provided by fake-useragent every request
rotating IP addresses:
scrapy-proxies
you can also run it via local proxy & TOR:
I had this problem too. I used urllib
with tor
in python3
.
open terminal and type:
curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>
if you see result it's worked.
import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#set socks5 proxy to use tor
socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())
if you see
Congratulations. This browser is configured to use Tor.
it worked in python too and this means you are using tor for web scraping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With