Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get proxy response in middleware

i have the following problem with scrapy in my middleware:

I do a request to a site with https and also use a proxy. When defining a middleware and using process_response in it, response.headers does only have the headers from the website. Is there any way to get the headers from the CONNECT request the proxy tunnel establish? The proxy we are using is adding some informations as headers in this response, we want to use it in the middleware. I found out that in TunnelingTCP4ClientEndpoint.processProxyResponse the parameter rcvd_bytes has all infos i need. I didn't find a way to get the rcvd_bytes in my middleware.

Also i found a similiar (same) issue from a year ago which is not solved: Not receiving headers Scrapy ProxyMesh

Here is the example from the proxy website:

For HTTPS the IP is in the CONNECT response header x-hola-ip Example for Proxy Peer IP of 5.6.7.8:

Request
CONNECT example.com:80 HTTP/1.1
Host: example.com:80
Accept: */*

Response:
HTTP/1.1 200 OK
Content-Type: text/html
x-hola-ip: 5.6.7.8

I want to get x-hola-ip in this example.

When using curl like curl --proxy mysuperproxy https://stackoverflow.com i get also the right data in the CONNECT response.

If this is not possible my possible solution is to monkey patch the class somehow so far, or maybe you know a better solution for that in python.

Thanks in advance for your help.

Note: I also posted this question on the github issues of scrapy, i will update both sites if i find any solution :)

Working solution with help of Matthew:

from scrapy.core.downloader.handlers.http11 import (
    HTTP11DownloadHandler, ScrapyAgent, TunnelingTCP4ClientEndpoint, TunnelError, TunnelingAgent
)
from scrapy import twisted_version

class MyHTTPDownloader(HTTP11DownloadHandler):
    i = ''
    def download_request(self, request, spider):
        # we're just overriding here to monkey patch the attribute
        agent = ScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
            maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
            warnsize=getattr(spider, 'download_warnsize', self._default_warnsize),
            fail_on_dataloss=self._fail_on_dataloss)


        agent._TunnelingAgent = MyTunnelingAgent

        return agent.download_request(request)

class MyTunnelingAgent(TunnelingAgent):
    if twisted_version >= (15, 0, 0):
        def _getEndpoint(self, uri):
            return MyTunnelingTCP4ClientEndpoint(
                self._reactor, uri.host, uri.port, self._proxyConf,
                self._contextFactory, self._endpointFactory._connectTimeout,
                self._endpointFactory._bindAddress)
    else:
        def _getEndpoint(self, scheme, host, port):
            return MyTunnelingTCP4ClientEndpoint(
                self._reactor, host, port, self._proxyConf,
                self._contextFactory, self._connectTimeout,
                self._bindAddress)

class MyTunnelingTCP4ClientEndpoint(TunnelingTCP4ClientEndpoint):
    def processProxyResponse(self, rcvd_bytes):
        # log('hier rcvd_bytes')
        MyHTTPDownloader.i = rcvd_bytes
        return super(MyTunnelingTCP4ClientEndpoint, self).processProxyResponse(rcvd_bytes)

And in your settings:

DOWNLOAD_HANDLERS = {
    'http': 'crawler.MyHTTPDownloader.MyHTTPDownloader',
    'https': 'crawler.MyHTTPDownloader.MyHTTPDownloader',
}
like image 560
Bernd Avatar asked May 13 '26 19:05

Bernd


1 Answers

I saw in #3329 that someone from Scrapinghub said it is unlikely they will add that feature, and recommended creating a custom subclass to get the behavior that you wanted. So, with that in mind:

I believe after you create the subclass, you can tell scrapy to use it by setting the http and https keys in DOWNLOAD_HANDLERS to point to your subclass.

Bear in mind that I don't have a local http proxy that sends extra headers to test, so this is just a "napkin sketch" of what I think needs to happen:

from scrapy.core.downloader.handlers.http11 import (
    HTTP11DownloadHandler, ScrapyAgent, TunnelingAgent,
)

class MyHTTPDownloader(HTTP11DownloadHandler):
    def download_request(self, request, spider):
        # we're just overriding here to monkey patch the attribute
        ScrapyAgent._TunnelingAgent = MyTunnelingAgent
        return super(MyHTTPDownloader, self).download_request(request, spider)

class MyTunnelingAgent(TunnelingAgent):
    # ... and here is where it would get weird

That last bit waves hands because I believe I have a clear understanding of the methods one needs to override to capture the bytes you want, but I don't have enough of the Twisted framework in my head to know where to put them in order to expose them to the Response that goes back to the spider.

like image 180
mdaniel Avatar answered May 15 '26 08:05

mdaniel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!