Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy not printing out stacktrace on exception

Is there a special mechanism to force scrapy to print out all python exception/stacktrace.

I made a simple mistake of getting a list attribute wrong resulting in AttributeError which did not show up in full in the logs What showed up was :

2015-11-15 22:13:50 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 264,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 40342,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 11, 15, 22, 13, 50, 860480),
 'log_count/CRITICAL': 1,
 'log_count/DEBUG': 1,
 'log_count/INFO': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/AttributeError': 1,
 'start_time': datetime.datetime(2015, 11, 15, 22, 13, 49, 222371)}

So it showed the AttributeError count of 1, but didnt tell me where and how, I had to manually place ipdb.set_trace() in code to find out where it got an error. Scrapy by itself continued to carry out other threads without printing anything

ipdb>
AttributeError: "'list' object has no attribute 'match'"
> /Users/username/Programming/regent/regentscraper/spiders/regent_spider.py(139)request_listing_detail_pages_from_listing_id_list()
    138             volatile_props = ListingScanVolatilePropertiesItem()
--> 139             volatile_props['position_in_search'] = list_of_listing_ids.match(listing_id) + rank_of_first_item_in_page
    140

scrapy settings

# -*- coding: utf-8 -*-

# Scrapy settings for regentscraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

import sys
import os
import django
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__name__), os.pardir)))

print sys.path

os.environ['DJANGO_SETTINGS_MODULE'] = 'regent.settings'
django.setup()  #new for Django 1.8



BOT_NAME = 'regentscraper'

SPIDER_MODULES = ['regentscraper.spiders']
NEWSPIDER_MODULE = 'regentscraper.spiders'


ITEM_PIPELINES = {
   'regentscraper.pipelines.ListingScanPipeline': 300,
}
like image 867
dowjones123 Avatar asked Feb 24 '26 00:02

dowjones123


1 Answers

I encountered the same event as described above. The following version is used in my environments:

  • Django (1.11.4)
  • Scrapy (1.4.0)
  • scrapy-djangoitem (1.1.1)

And I solved the problem by adding "LOGGING_CONFIG = None" in dnango's settings that is loaded in scrapy. I created a new django's setting file as settings_scrapy with following contensts:

mysite.settings_scrapy

try:
    from mysite.settings import *
    LOGGING_CONFIG = None
except ImportError:
    pass

Then, the settings file is loaded in scrapy's settings file as:

import sys
import os
import django
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings_scrapy'
django.setup()

After that, stacktrace on exceptions in spider and pipeline appered.

Reference

https://docs.djangoproject.com/en/1.11/topics/logging/#disabling-logging-configuration

like image 182
Akihisa Oishi Avatar answered Feb 25 '26 14:02

Akihisa Oishi