Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proxy IP for Scrapy framework

I am developing a web crawling project using Python and Scrapy framework. It crawls approax 10k web pages from e-commerce shopping websites. whole project is working fine but before moving the code from testing server into production server i want choose a better proxy ip provider service, so that i dont have to worry about my IP Blocking or Denied access of websites to my spiders .

Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list available of various websites like this

Now i am confused about the options i should chosse

  1. Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/

  2. Use TOR

  3. Use VPN Service like http://www.hotspotshield.com/

  4. Any Option better than above three

like image 746
Binit Singh Avatar asked Oct 18 '13 09:10

Binit Singh


People also ask

Does Scrapy use proxy?

As a web scraping tool, Scrapy has support for proxies, and you will most likely make use of proxies in your scraping project.

How do you connect to Scrapy?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .

What is a rotating proxy?

What is a rotating proxy? A rotating proxy is a proxy server that assigns a new IP address from the proxy pool for every connection. That means you can launch a script to send 10,000 requests to any number of sites and get 10,000 different IP addresses.


2 Answers

Here are the options I'm currently using (depending on my needs):

  • proxymesh.com - reasonable prices for smaller projects. Never had any issues with the service as it works out of the box with scrapy (I'm not affiliated with them)
  • a self-build script that starts several EC2 micro instances on Amazon. I then SSH into the machines and create a SOCKS proxy connection, those connections are then piped through delegated to create normal http proxies which are usable with scrapy. The http proxies can either be loadbalanced with something like haproxy or you build yourself a custom middleware that rotates proxies

The latter solution is what currently works best for me and pushes around 20-30GB per day of traffic without any problems.

like image 64
herrherr Avatar answered Oct 19 '22 08:10

herrherr


Crawlera is built specifically for web crawling projects. For example, it implements smart algorithms to avoid getting banned and it is used to crawl very large and high profile websites.

Disclaimer: I work for the mother company Scrapinghub, who also are core developers of Scrapy.

like image 36
R. Max Avatar answered Oct 19 '22 07:10

R. Max