Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy : UNFORMATTABLE OBJECT WRITTEN TO LOG

Tags:

python

scrapy

I'm stuck with this log now for 3 days:

2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-03 11:32:54-0700 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilterFieldsPipeline
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Spider opened
2014-06-03 11:32:54-0700 [NefsakLaptopSpider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-03 11:32:54-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-03 11:32:56-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Crawled (%(status)s) %(request)s (referer: %(referer)s)%(flags)s', MESSAGE LOST
2014-06-03 11:33:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-06-03 11:34:54-0700 [NefsakLaptopSpider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
More like the last line... Forever and very slowly

The offensive line 4th from bottom, appears only when I set the logging level in Scrapy to DEBUG.

Here's the header of my spider:

class ScrapyCrawler(CrawlSpider):
  name = "ScrapyCrawler"
  def __init__(self, spiderPath, spiderID, name="ScrapyCrawler", *args, **kwargs):
    super(ScrapyCrawler, self).__init__()
    self.name = name
    self.path = spiderPath
    self.id = spiderID
    self.path_index = 0
    self.favicon_required = kwargs.get("downloadFavicon", True) #the favicon for the scraped site will be added to the first item
    self.favicon_item = None

  def start_requests(self):
    start_path = self.path.pop(0)
    # determine the callback based on next step
    callback = self.parse_intermediate if type(self.path[0]) == URL \
          else self.parse_item_pages
    if type(start_path) == URL:
      start_url = start_path
      request = Request(start_path, callback=callback)
    elif type(start_path) == Form:
      start_url = start_path.url
      request = FormRequest(start_path.url, start_path.data, 
                          callback=callback)

    return [request]

  def parse_intermediate(self, response):
     ...

  def parse_item_pages(self, response):
     ...

The thing is, none of the callbacks are called after start_requests().

Here's a hint: The first request out of start_request() is to a page like http://www.example.com. If I change http to https, this causes a redirect in scrapy and the log changes to this:

2014-06-03 12:00:51-0700 [NefsakLaptopSpider] UNFORMATTABLE OBJECT WRITTEN TO LOG with fmt 'DEBUG: Redirecting (%(reason)s) to %(redirected)s from %(request)s', MESSAGE LOST
2014-06-03 12:00:51-0700 [NefsakLaptopSpider] DEBUG: Redirecting (302) to <GET http://www.nefsak.com/home.php?cat=58> from <GET http://www.nefsak.com/home.php?cat=58&xid_be279=248933808671e852497b0b1b33333a8b>
2014-06-03 12:00:52-0700 [NefsakLaptopSpider] DEBUG: Redirecting (301) to <GET http://www.nefsak.com/15-17-Screen/> from <GET http://www.nefsak.com/home.php?cat=58>
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/> (referer: None)
2014-06-03 12:00:54-0700 [NefsakLaptopSpider] ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.nefsak.com/15-17-Screen/>
2014-06-03 12:00:56-0700 [NefsakLaptopSpider] DEBUG: Crawled (200) <GET http://www.nefsak.com/15-17-Screen/?page=4> (referer: http://www.nefsak.com/15-17-Screen/)
More extracted links and more errors like above, then it finishes, unlike former log

As you can see from the last line, the spider has actually gone and extracted a navigation page!. All By Itself.(There's a navigation extraction code, but it doesn't get called, as the debugger breakpoints are never reached).

Unfortunately, I couldn't reproduce the error outside the project. A similar spider just works!. But not inside the project though.

I'll provide more code if requested.

Thanks, and sorry for the long post.

like image 508
MadeOfAir Avatar asked Jan 20 '26 07:01

MadeOfAir


1 Answers

Well, I had a URL class derived from the built-in str. It was coded like that:

class URL(str):

  def canonicalize(self, parentURL):
    parsed_self = urlparse.urlparse(self)
    if parsed_self.scheme:
      return self[:] #string copy?
    else:
      parsed_parent = urlparse.urlparse(parentURL)
      return urlparse.urljoin(parsed_parent.scheme + "://" + parsed_parent.netloc, self)

  def __str__(self):
    return "<URL : {0} >".format(self)

The __str__ method caused infinite recursion when it was printed or logged, because format() called __str__ again... But the exception was swallowed by twisted somehow. Only when printed the response that the error was shown.

def __str__(self):
  return "<URL : " + self + " >" # or use super(URL, self).__str__()

:-)

like image 65
MadeOfAir Avatar answered Jan 22 '26 19:01

MadeOfAir



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!