Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB InvalidDocument: Cannot encode object

Tags:

I am using scrapy to scrap blogs and then store the data in mongodb. At first i got the InvalidDocument Exception. So obvious to me is that the data is not in the right encoding. So before persisting the object, in my MongoPipeline i check if the document is in 'utf-8 strict', and only then i try to persist the object to mongodb. BUT Still i get InvalidDocument Exceptions, now that is annoying.

This is my code my MongoPipeline Object that persists objects to mongodb

# -*- coding: utf-8 -*-  # Define your item pipelines here #  import pymongo import sys, traceback from scrapy.exceptions import DropItem from crawler.items import BlogItem, CommentItem   class MongoPipeline(object):     collection_name = 'master'      def __init__(self, mongo_uri, mongo_db):         self.mongo_uri = mongo_uri         self.mongo_db = mongo_db      @classmethod     def from_crawler(cls, crawler):         return cls(             mongo_uri=crawler.settings.get('MONGO_URI'),             mongo_db=crawler.settings.get('MONGO_DATABASE', 'posts')         )      def open_spider(self, spider):         self.client = pymongo.MongoClient(self.mongo_uri)         self.db = self.client[self.mongo_db]      def close_spider(self, spider):         self.client.close()      def process_item(self, item, spider):          if type(item) is BlogItem:             try:                 if 'url' in item:                     item['url'] = item['url'].encode('utf-8', 'strict')                 if 'domain' in item:                     item['domain'] = item['domain'].encode('utf-8', 'strict')                 if 'title' in item:                     item['title'] = item['title'].encode('utf-8', 'strict')                 if 'date' in item:                     item['date'] = item['date'].encode('utf-8', 'strict')                 if 'content' in item:                     item['content'] = item['content'].encode('utf-8', 'strict')                 if 'author' in item:                     item['author'] = item['author'].encode('utf-8', 'strict')              except:  # catch *all* exceptions                 e = sys.exc_info()[0]                 spider.logger.critical("ERROR ENCODING %s", e)                 traceback.print_exc(file=sys.stdout)                 raise DropItem("Error encoding BLOG %s" % item['url'])              if 'comments' in item:                 comments = item['comments']                 item['comments'] = []                  try:                     for comment in comments:                         if 'date' in comment:                             comment['date'] = comment['date'].encode('utf-8', 'strict')                         if 'author' in comment:                             comment['author'] = comment['author'].encode('utf-8', 'strict')                         if 'content' in comment:                             comment['content'] = comment['content'].encode('utf-8', 'strict')                          item['comments'].append(comment)                  except:  # catch *all* exceptions                     e = sys.exc_info()[0]                     spider.logger.critical("ERROR ENCODING COMMENT %s", e)                     traceback.print_exc(file=sys.stdout)          self.db[self.collection_name].insert(dict(item))          return item 

And still i get the following exception:

au coeur de l\u2019explosion de la bulle Internet n\u2019est probablement pas \xe9tranger au succ\xe8s qui a suivi. Mais franchement, c\u2019est un peu court comme argument !Ce que je sais dire, compte tenu de ce qui pr\xe9c\xe8de, c\u2019est quelles sont les conditions pour r\xe9ussir si l\u2019on est vraiment contraint de rester en France. Ce sont des sujets que je d\xe9velopperai dans un autre article.',      'date': u'2012-06-27T23:21:25+00:00',      'domain': 'reussir-sa-boite.fr',      'title': u'Peut-on encore entreprendre en France ?\t\t\t ',      'url': 'http://www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}     Traceback (most recent call last):       File "h:\program files\anaconda\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks         current.result = callback(current.result, *args, **kw)       File "H:\PDS\BNP\crawler\crawler\pipelines.py", line 76, in process_item         self.db[self.collection_name].insert(dict(item))       File "h:\program files\anaconda\lib\site-packages\pymongo\collection.py", line 409, in insert         gen(), check_keys, self.uuid_subtype, client)     InvalidDocument: Cannot encode object: {'author': 'Arnaud Lemasson',      'content': 'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me co\xc3\xbbterait bien trop cher. Bref, 100% d\xe2\x80\x99accord avec vous. Le probl\xc3\xa8me, je ne vois pas comment cela pourrait changer avec le gouvernement actuel\xe2\x80\xa6 A moins que si, j\xe2\x80\x99ai pu lire il me semble qu\xe2\x80\x99ils avaient en t\xc3\xaate de r\xc3\xa9duire l\xe2\x80\x99IS pour les petites entreprises et de l\xe2\x80\x99augmenter pour les grandes\xe2\x80\xa6 A voir',      'date': '2012-06-27T23:21:25+00:00'}     2015-11-04 15:29:15 [scrapy] INFO: Closing spider (finished)     2015-11-04 15:29:15 [scrapy] INFO: Dumping Scrapy stats:     {'downloader/request_bytes': 259,      'downloader/request_count': 1,      'downloader/request_method_count/GET': 1,      'downloader/response_bytes': 252396,      'downloader/response_count': 1,      'downloader/response_status_count/200': 1,      'finish_reason': 'finished',      'finish_time': datetime.datetime(2015, 11, 4, 14, 29, 15, 701000),      'log_count/DEBUG': 2,      'log_count/ERROR': 1,      'log_count/INFO': 7,      'response_received_count': 1,      'scheduler/dequeued': 1,      'scheduler/dequeued/memory': 1,      'scheduler/enqueued': 1,      'scheduler/enqueued/memory': 1,      'start)     time': datetime.datetime(2015, 11, 4, 14, 29, 13, 191000)} 

Another funny thing from the comment of @eLRuLL i did the following:

>>> s = "Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me" >>> s 'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me' >>> se = s.encode("utf8", "strict") Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128) >>> se = s.encode("utf-8", "strict") Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128) >>> s.decode() Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128) 

Then my question is. If this text cannot be encoded. Then why, is my MongoPipeline try catch not catching this EXCEPTION? Because only objects that don't raise any exception should be appended to item['comments'] ?

like image 756
Codious-JR Avatar asked Nov 04 '15 14:11

Codious-JR


1 Answers

Finally I figured it out. The problem was not with encoding. It was with the structure of the documents.

Because i went off on the standard MongoPipeline example which does not deal with nested scrapy items.

What i am doing is: BlogItem: "url" ... comments = [CommentItem]

So my BlogItem has a list of CommentItems. Now the problem came here, for persisting the object in the database i do:

self.db[self.collection_name].insert(dict(item)) 

So here i am parsing the BlogItem to a dict. But i am not parsing the list of CommentItems. And because the traceback displays the CommentItem kind of like a dict, It did not occur to me that the problematic object is not a dict!

So finally the the way to fix this problem is to change the line when appending the comment to the comment list as such:

item['comments'].append(dict(comment)) 

Now MongoDB considers it as a valid document.

Lastly, for the last part where i ask why i am getting a exception on the python console and not in the script.

The reason is because i was working on the python console, which only supports ascii. And thus the error.

like image 102
Codious-JR Avatar answered Oct 09 '22 02:10

Codious-JR