Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

set proxy To hide my IP address for scraping the webpage using scrapy

Tags:

web-scraping

I am using scrapy to crawl website now I need to set proxy handle the request which has been sent. Can anyone help me solve this set proxy in scrapy app. Please give any sample link too if you have so. And I need solution that from which IP this request is going.

like image 474
Kavi Rajan Avatar asked Mar 22 '12 10:03

Kavi Rajan


People also ask

What is proxy in web scraping?

A proxy is essentially a middleman server that sits between the client and the server. There are many usages for proxies like optimizing connection routes, but most commonly proxies for web scraping are used to disguise the client's IP address (identity).

How do I know if my proxy is working Scrapy?

There is a middleware in the Scrapy called Proxy Middleware which passes the request object and sets it up. It is important that you try the proxy before you use it. You can test it on a test site. If the Site shows you the IP address of your proxy and not the actual IP then it is working.


1 Answers

You can do it through the code below found here:

1 – Create a new file called middlewares.py and save it in your scrapy project and add the following code to it.

# Importing base64 library because we'll need it ONLY
#in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
  # overwrite process request
  def process_request(self, request, spider):
    # Set the location of the proxy
    request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

    # Use the following lines if your proxy requires authentication
    proxy_user_pass = "USERNAME:PASSWORD"
    # setup basic authentication for the proxy
    encoded_user_pass = base64.encodestring(proxy_user_pass)
    request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project’s configuration file (./project_name/settings.py) and add the following code

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'project_name.middlewares.ProxyMiddleware': 100,
}

Also, you can use multiple proxies with scrapy. More information can be found here.

like image 76
Thanasis Petsas Avatar answered Sep 22 '22 20:09

Thanasis Petsas