Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Error while receiving a control message (SocketClosed): empty socket content" in Tor's stem controller

I'm working on a scraper which uses Tor, of which a simplified version is in this example project: https://github.com/khpeek/scraper-compose. The project has the following (simplified) structure:

.
├── docker-compose.yml
├── privoxy
│   ├── config
│   └── Dockerfile
├── scraper
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── tutorial
│   │   ├── scrapy.cfg
│   │   └── tutorial
│   │       ├── extensions.py
│   │       ├── __init__.py
│   │       ├── items.py
│   │       ├── middlewares.py
│   │       ├── pipelines.py
│   │       ├── settings.py
│   │       ├── spiders
│   │       │   ├── __init__.py
│   │       │   └── quotes_spider.py
│   │       └── tor_controller.py
│   └── wait-for
│       └── wait-for
└── tor
    ├── Dockerfile
    └── torrc

The spider, defined quotes_spider.py, is a very simple one based on the Scrapy Tutorial:

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/page/{n}/'.format(n=n) for n in range(1, 3)]

    custom_settings = {
                       'TOR_RENEW_IDENTITY_ENABLED': True,
                       'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 5
                       }

    download_delay = 2    # Wait 2 seconds (actually a random time between 1 and 3 seconds) between downloading pages


    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').extract_first()
            item['author'] = quote.css('small.author::text').extract_first()
            item['tags'] = quote.css('div.tags a.tag::text').extract()
            yield item

In settings.py, I've activated a Scrapy extension with the lines

EXTENSIONS = {
   'tutorial.extensions.TorRenewIdentity': 1,
}

where extensions.py is

import logging
import random
from scrapy import signals
from scrapy.exceptions import NotConfigured

import tutorial.tor_controller as tor_controller

logger = logging.getLogger(__name__)

class TorRenewIdentity(object):

    def __init__(self, crawler, item_count):
        self.crawler = crawler
        self.item_count = self.randomize(item_count)    # Randomize the item count to confound traffic analysis
        self._item_count = item_count                   # Also remember the given item count for future randomizations
        self.items_scraped = 0

        # Connect the extension object to signals
        self.crawler.signals.connect(self.item_scraped, signal=signals.item_scraped)

    @staticmethod
    def randomize(item_count, min_factor=0.5, max_factor=1.5):
        '''Randomize the number of items scraped before changing identity. (A similar technique is applied to Scrapy's DOWNLOAD_DELAY setting).'''
        randomized_item_count = random.randint(int(min_factor*item_count), int(max_factor*item_count))
        logger.info("The crawler will scrape the following (randomized) number of items before changing identity (again): {}".format(randomized_item_count))
        return randomized_item_count

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('TOR_RENEW_IDENTITY_ENABLED'):
            raise NotConfigured

        item_count = crawler.settings.getint('TOR_ITEMS_TO_SCRAPE_PER_IDENTITY', 50)

        return cls(crawler=crawler, item_count=item_count)          # Instantiate the extension object

    def item_scraped(self, item, spider):
        '''When item_count items are scraped, pause the engine and change IP address.'''
        self.items_scraped += 1
        if self.items_scraped == self.item_count:
            logger.info("Scraped {item_count} items. Pausing engine while changing identity...".format(item_count=self.item_count))

            self.crawler.engine.pause()

            tor_controller.change_identity()                        # Change IP address (cf. https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor)
            self.items_scraped = 0                                  # Reset the counter
            self.item_count = self.randomize(self._item_count)      # Generate a new random number of items to scrape before changing identity again

            self.crawler.engine.unpause()

and tor_controller.py is

import logging
import sys
import socket
import time
import requests
import stem
import stem.control

# Tor settings
TOR_ADDRESS = socket.gethostbyname("tor")           # The Docker-Compose service in which this code is running should be linked to the "tor" service.
TOR_CONTROL_PORT = 9051         # This is configured in /etc/tor/torrc by the line "ControlPort 9051" (or by launching Tor with "tor -controlport 9051")
TOR_PASSWORD = "foo"            # The Tor password is written in the docker-compose.yml file. (It is passed as a build argument to the 'tor' service).

# Privoxy settings
PRIVOXY_ADDRESS = "privoxy"     # This assumes this code is running in a Docker-Compose service linked to the "privoxy" service
PRIVOXY_PORT = 8118             # This is determined by the "listen-address" in Privoxy's "config" file
HTTP_PROXY = 'http://{address}:{port}'.format(address=PRIVOXY_ADDRESS, port=PRIVOXY_PORT)

logger = logging.getLogger(__name__)


class TorController(object):
    def __init__(self):
        self.controller = stem.control.Controller.from_port(address=TOR_ADDRESS, port=TOR_CONTROL_PORT)
        self.controller.authenticate(password=TOR_PASSWORD)
        self.session = requests.Session()
        self.session.proxies = {'http': HTTP_PROXY}

    def request_ip_change(self):
        self.controller.signal(stem.Signal.NEWNYM)

    def get_ip(self):
        '''Check what the current IP address is (as seen by IPEcho).'''
        return self.session.get('http://ipecho.net/plain').text

    def change_ip(self):
        '''Signal a change of IP address and wait for confirmation from IPEcho.net'''
        current_ip = self.get_ip()
        logger.debug("Initializing change of identity from the current IP address, {current_ip}".format(current_ip=current_ip))
        self.request_ip_change()
        while True:
            new_ip = self.get_ip()
            if new_ip == current_ip:
                logger.debug("The IP address is still the same. Waiting for 1 second before checking again...")
                time.sleep(1)
            else:
                break
        logger.debug("The IP address has been changed from {old_ip} to {new_ip}".format(old_ip=current_ip, new_ip=new_ip))
        return new_ip

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.controller.close()


def change_identity():
    with TorController() as tor_controller:
        tor_controller.change_ip()

If I start crawling using docker-compose build followed by docker-compose up, by and large the extension works: according to the logs it successfully changes IP address and continues scraping.

What irks me, however, is that during the period the engine is paused, I see error messages such as

scraper_1  | 2017-05-12 16:35:06 [stem] INFO: Error while receiving a control message (SocketClosed): empty socket content

followed by

scraper_1  | 2017-05-12 16:35:06 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"

What is causing these errors? Since they have the INFO level, perhaps I can just ignore them? (I've looked a bit at Stem's source code at https://gitweb.torproject.org/stem.git/, but so far haven't been able to get a handle on what is happening).

like image 877
Kurt Peek Avatar asked May 12 '17 16:05

Kurt Peek


3 Answers

You can get the stem's Logger instance and turn it off like this:

from stem.util.log import get_logger

logger = get_logger()
logger.propagate = False
like image 125
asvatov Avatar answered Sep 30 '22 18:09

asvatov


I don't know if or what conclusion you have reached regarding your issue.

I actually was getting the same log message as you did. My Scrapy project performed well, and the ip rotation using Tor and privoxy worked successfully as well. I just kept getting the log INFO : [stem ] Error while receiving a control message (SocketClosed): empty socket content, and that was bugging the hell out of me.

I spent some time digging around to find out what causes it, and to see if it is okay for me to ignore it (after all, it is just an info message not an error message).

The bottom line is I don't know what causes it, but I felt it was safe enough to ignore it.

As the log says, the socket content (which is actually stem control_file that contains relevant information regarding socket connection) is empty. When control_file is empty, it triggers to close the socket connection(according to the python socket documentation). I am not sure what causes the control_file to be empty to close the socket connection. However, if the socket connection is really closed, it looks like the socket connections get opened successfully since my scrapy's crawling job and ip rotation works well. Though I couldn't find the true cause of it, I can only assume a few causes: (1) Tor network is unstable, (2) when your code runs controller.signal(Signal.NEWNYM), the socket is temporarily closed and gets opend again, or some other reason that I cannot think of at the moment.

like image 36
btaek Avatar answered Sep 30 '22 17:09

btaek


I am also in a similar situation and and realized that the log messages were not impactful to ip rotation or crawling. However, these log messages are annoying and can be silenced using a decorator:

def silence(func):
    def wrapper(*args, **kwargs):
        logger = logging.getLogger('stem')
        logger.disabled = True
        ret = func(*args, **kwargs)
        logger.enabled = True
        return ret
    return wrapper
like image 34
James Taylor Avatar answered Sep 30 '22 16:09

James Taylor