Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping in Python - Preventing IP ban

I am using Python to scrape pages. Until now I didn't have any complicated issues.

The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.

Using Requests and lxml I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.

I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300) call on every 300th request). Despite that, Im getting banned.

From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.

Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.

like image 407
RhymeGuy Avatar asked Feb 01 '16 14:02

RhymeGuy


2 Answers

If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:

  • the built-in AutoThrottle extension:

This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.

  • rotating user agents with scrapy-fake-useragent middleware:

Use a random User-Agent provided by fake-useragent every request

  • rotating IP addresses:

    • Setting Scrapy proxy middleware to rotate on each request
    • scrapy-proxies
  • you can also run it via local proxy & TOR:

    • Scrapy: Run Using TOR and Multiple Agents
like image 173
alecxe Avatar answered Oct 23 '22 20:10

alecxe


I had this problem too. I used urllib with tor in python3.

  1. download and install tor browser
  2. testing tor

open terminal and type:

curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>

if you see result it's worked.

  1. Now we should test in python. Now run this code
import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

#set socks5 proxy to use tor

socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())

if you see

Congratulations. This browser is configured to use Tor.

it worked in python too and this means you are using tor for web scraping.

like image 37
Mohammad Reza Avatar answered Oct 23 '22 18:10

Mohammad Reza