I'm working on a scraper which uses Tor, of which a simplified version is in this example project: https://github.com/khpeek/scraper-compose. The project has the following (simplified) structure:
.
├── docker-compose.yml
├── privoxy
│ ├── config
│ └── Dockerfile
├── scraper
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── tutorial
│ │ ├── scrapy.cfg
│ │ └── tutorial
│ │ ├── extensions.py
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ ├── spiders
│ │ │ ├── __init__.py
│ │ │ └── quotes_spider.py
│ │ └── tor_controller.py
│ └── wait-for
│ └── wait-for
└── tor
├── Dockerfile
└── torrc
The spider, defined quotes_spider.py
, is a very simple one based on the Scrapy Tutorial:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/page/{n}/'.format(n=n) for n in range(1, 3)]
custom_settings = {
'TOR_RENEW_IDENTITY_ENABLED': True,
'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 5
}
download_delay = 2 # Wait 2 seconds (actually a random time between 1 and 3 seconds) between downloading pages
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').extract_first()
item['author'] = quote.css('small.author::text').extract_first()
item['tags'] = quote.css('div.tags a.tag::text').extract()
yield item
In settings.py
, I've activated a Scrapy extension with the lines
EXTENSIONS = {
'tutorial.extensions.TorRenewIdentity': 1,
}
where extensions.py
is
import logging
import random
from scrapy import signals
from scrapy.exceptions import NotConfigured
import tutorial.tor_controller as tor_controller
logger = logging.getLogger(__name__)
class TorRenewIdentity(object):
def __init__(self, crawler, item_count):
self.crawler = crawler
self.item_count = self.randomize(item_count) # Randomize the item count to confound traffic analysis
self._item_count = item_count # Also remember the given item count for future randomizations
self.items_scraped = 0
# Connect the extension object to signals
self.crawler.signals.connect(self.item_scraped, signal=signals.item_scraped)
@staticmethod
def randomize(item_count, min_factor=0.5, max_factor=1.5):
'''Randomize the number of items scraped before changing identity. (A similar technique is applied to Scrapy's DOWNLOAD_DELAY setting).'''
randomized_item_count = random.randint(int(min_factor*item_count), int(max_factor*item_count))
logger.info("The crawler will scrape the following (randomized) number of items before changing identity (again): {}".format(randomized_item_count))
return randomized_item_count
@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool('TOR_RENEW_IDENTITY_ENABLED'):
raise NotConfigured
item_count = crawler.settings.getint('TOR_ITEMS_TO_SCRAPE_PER_IDENTITY', 50)
return cls(crawler=crawler, item_count=item_count) # Instantiate the extension object
def item_scraped(self, item, spider):
'''When item_count items are scraped, pause the engine and change IP address.'''
self.items_scraped += 1
if self.items_scraped == self.item_count:
logger.info("Scraped {item_count} items. Pausing engine while changing identity...".format(item_count=self.item_count))
self.crawler.engine.pause()
tor_controller.change_identity() # Change IP address (cf. https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor)
self.items_scraped = 0 # Reset the counter
self.item_count = self.randomize(self._item_count) # Generate a new random number of items to scrape before changing identity again
self.crawler.engine.unpause()
and tor_controller.py
is
import logging
import sys
import socket
import time
import requests
import stem
import stem.control
# Tor settings
TOR_ADDRESS = socket.gethostbyname("tor") # The Docker-Compose service in which this code is running should be linked to the "tor" service.
TOR_CONTROL_PORT = 9051 # This is configured in /etc/tor/torrc by the line "ControlPort 9051" (or by launching Tor with "tor -controlport 9051")
TOR_PASSWORD = "foo" # The Tor password is written in the docker-compose.yml file. (It is passed as a build argument to the 'tor' service).
# Privoxy settings
PRIVOXY_ADDRESS = "privoxy" # This assumes this code is running in a Docker-Compose service linked to the "privoxy" service
PRIVOXY_PORT = 8118 # This is determined by the "listen-address" in Privoxy's "config" file
HTTP_PROXY = 'http://{address}:{port}'.format(address=PRIVOXY_ADDRESS, port=PRIVOXY_PORT)
logger = logging.getLogger(__name__)
class TorController(object):
def __init__(self):
self.controller = stem.control.Controller.from_port(address=TOR_ADDRESS, port=TOR_CONTROL_PORT)
self.controller.authenticate(password=TOR_PASSWORD)
self.session = requests.Session()
self.session.proxies = {'http': HTTP_PROXY}
def request_ip_change(self):
self.controller.signal(stem.Signal.NEWNYM)
def get_ip(self):
'''Check what the current IP address is (as seen by IPEcho).'''
return self.session.get('http://ipecho.net/plain').text
def change_ip(self):
'''Signal a change of IP address and wait for confirmation from IPEcho.net'''
current_ip = self.get_ip()
logger.debug("Initializing change of identity from the current IP address, {current_ip}".format(current_ip=current_ip))
self.request_ip_change()
while True:
new_ip = self.get_ip()
if new_ip == current_ip:
logger.debug("The IP address is still the same. Waiting for 1 second before checking again...")
time.sleep(1)
else:
break
logger.debug("The IP address has been changed from {old_ip} to {new_ip}".format(old_ip=current_ip, new_ip=new_ip))
return new_ip
def __enter__(self):
return self
def __exit__(self, *args):
self.controller.close()
def change_identity():
with TorController() as tor_controller:
tor_controller.change_ip()
If I start crawling using docker-compose build
followed by docker-compose up
, by and large the extension works: according to the logs it successfully changes IP address and continues scraping.
What irks me, however, is that during the period the engine is paused, I see error messages such as
scraper_1 | 2017-05-12 16:35:06 [stem] INFO: Error while receiving a control message (SocketClosed): empty socket content
followed by
scraper_1 | 2017-05-12 16:35:06 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"
What is causing these errors? Since they have the INFO
level, perhaps I can just ignore them? (I've looked a bit at Stem's source code at https://gitweb.torproject.org/stem.git/, but so far haven't been able to get a handle on what is happening).
You can get the stem's Logger instance and turn it off like this:
from stem.util.log import get_logger
logger = get_logger()
logger.propagate = False
I don't know if or what conclusion you have reached regarding your issue.
I actually was getting the same log message as you did. My Scrapy project performed well, and the ip rotation using Tor and privoxy worked successfully as well. I just kept getting the log INFO : [stem ] Error while receiving a control message (SocketClosed): empty socket content
, and that was bugging the hell out of me.
I spent some time digging around to find out what causes it, and to see if it is okay for me to ignore it (after all, it is just an info message not an error message).
The bottom line is I don't know what causes it, but I felt it was safe enough to ignore it.
As the log says, the socket content (which is actually stem control_file that contains relevant information regarding socket connection) is empty. When control_file is empty, it triggers to close the socket connection(according to the python socket documentation). I am not sure what causes the control_file to be empty to close the socket connection. However, if the socket connection is really closed, it looks like the socket connections get opened successfully since my scrapy's crawling job and ip rotation works well. Though I couldn't find the true cause of it, I can only assume a few causes: (1) Tor network is unstable, (2) when your code runs controller.signal(Signal.NEWNYM)
, the socket is temporarily closed and gets opend again, or some other reason that I cannot think of at the moment.
I am also in a similar situation and and realized that the log messages were not impactful to ip rotation or crawling. However, these log messages are annoying and can be silenced using a decorator:
def silence(func):
def wrapper(*args, **kwargs):
logger = logging.getLogger('stem')
logger.disabled = True
ret = func(*args, **kwargs)
logger.enabled = True
return ret
return wrapper
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With