I am developing a web crawling project using Python and Scrapy framework. It crawls approax 10k web pages from e-commerce shopping websites. whole project is working fine but before moving the code from testing server into production server i want choose a better proxy ip provider service, so that i dont have to worry about my IP Blocking or Denied access of websites to my spiders . Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list available of various websites like this Now i am confused about the options i should chosse <ol> <li>Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/</li> <li>Use TOR</li> <li>Use VPN Service like http://www.hotspotshield.com/</li> <li>Any Option better than above three</li> </ol>

Here are the options I'm currently using (depending on my needs): <ul> <li> proxymesh.com - reasonable prices for smaller projects. Never had any issues with the service as it works out of the box with scrapy (I'm not affiliated with them)</li> <li>a self-build script that starts several EC2 micro instances on Amazon. I then SSH into the machines and create a SOCKS proxy connection, those connections are then piped through delegated to create normal http proxies which are usable with scrapy. The http proxies can either be loadbalanced with something like haproxy or you build yourself a custom middleware that rotates proxies</li> </ul> The latter solution is what currently works best for me and pushes around 20-30GB per day of traffic without any problems.

Proxy IP for Scrapy framework

Tags:

python

proxy

tor

scrapy

I am developing a web crawling project using Python and Scrapy framework. It crawls approax 10k web pages from e-commerce shopping websites. whole project is working fine but before moving the code from testing server into production server i want choose a better proxy ip provider service, so that i dont have to worry about my IP Blocking or Denied access of websites to my spiders .

Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list available of various websites like this

Now i am confused about the options i should chosse

Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/
Use TOR
Use VPN Service like http://www.hotspotshield.com/
Any Option better than above three

746

asked Oct 18 '13 09:10

Binit Singh

2 Answers

Here are the options I'm currently using (depending on my needs):

proxymesh.com - reasonable prices for smaller projects. Never had any issues with the service as it works out of the box with scrapy (I'm not affiliated with them)
a self-build script that starts several EC2 micro instances on Amazon. I then SSH into the machines and create a SOCKS proxy connection, those connections are then piped through delegated to create normal http proxies which are usable with scrapy. The http proxies can either be loadbalanced with something like haproxy or you build yourself a custom middleware that rotates proxies

The latter solution is what currently works best for me and pushes around 20-30GB per day of traffic without any problems.

answered Oct 19 '22 08:10

herrherr

Crawlera is built specifically for web crawling projects. For example, it implements smart algorithms to avoid getting banned and it is used to crawl very large and high profile websites.

Disclaimer: I work for the mother company Scrapinghub, who also are core developers of Scrapy.

answered Oct 19 '22 07:10

R. Max

Related questions
                            
                                When is it appropriate to use a database , in Python
                            
                                XML parsing in python: expaterror not well-formed
                            
                                Does assigning another variable to a string make a copy or increase the reference count
                            
                                OpenERP Unique Constraint
                            
                                Elementwise if elif function in python using arrays
                            
                                Large file not flushed to disk immediately after calling close()?
                            
                                scipy.optimize.curvefit() - array must not contain infs or NaNs
                            
                                Find the selected option using BeautifulSoup
                            
                                Numpy vectorize as a decorator with arguments
                            
                                Can I use rpy2 to save a pandas dataframe to an .Rdata file?
                            
                                Fit two normal distributions (histograms) with MCMC using pymc?
                            
                                Russian symbols in re (Python)
                            
                                Django Framework - Is there a shutdown event that can be subscribed to?
                            
                                Updating h5py Datasets
                            
                                Running a task after all tasks have been completed
                            
                                Celery and SQLAlchemy - This result object does not return rows. It has been closed automatically
                            
                                django factory boy factory with OneToOne relationship and related field
                            
                                R style Negative Indexing in Python. Take NOT IN Slices
                            
                                Postgresql - Insert into where not exists using sqlalchemy's INSERT from SELECT
                            
                                ARMA out-of-sample prediction with statsmodels

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With