scrapy text encoding



Here is my spider

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem

class vriskoSpider(CrawlSpider):
    name = 'vrisko'
    allowed_domains = ['vrisko.gr']
    start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)

    def parse_start_url(self, response):
        hxs = HtmlXPathSelector(response)
        vriskoit = VriskoItem()
        vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
        vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
        return vriskoit

My problem is that the returned strings are unicode and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.

Thank you in advance!

Since Scrapy 1.2.0, a new setting FEED_EXPORT_ENCODING is introduced. By specifying it as utf-8, JSON output will not be escaped.

That is to add in your settings.py:

Scrapy returns strings in unicode, not ascii. To encode all strings to utf-8, you can write:

vriskoit['eponimia'] = [s.encode('utf-8') for s in hxs.select('//a[@itemprop="name"]/text()').extract()]

But I think that you expect another result. Your code return one item with all search results. To return items for each result:

hxs = HtmlXPathSelector(response)
for eponimia, address in zip(hxs.select("//a[@itemprop='name']/text()").extract(),
    vriskoit = VriskoItem()
    vriskoit['eponimia'] = eponimia.encode('utf-8')
    vriskoit['address'] = address.encode('utf-8')
    yield vriskoit


JSON exporter writes unicode symbols escaped (e.g. \u03a4) by default, because not all streams can handle unicode. It has option to write them as unicode ensure_ascii=False (see docs for json.dumps) . But I can't find way to pass this option to standard feed exporter.

So if you want exported items to be written in utf-8 encoding, e.g. for read them in text editor, you can write custom item pipeline.


import json
import codecs

class JsonWithEncodingPipeline(object):

    def __init__(self):
        self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        return item

    def spider_closed(self, spider):

Don't forget to add this pipeline to settings.py:

 ITEM_PIPELINES = ['vrisko.pipelines.JsonWithEncodingPipeline']

You can customize pipeline to write data in more human readable format, e.g. you can generate some formated report. JsonWithEncodingPipeline is just basic example.

Try adding the following line to the config file for Scrapy (i.e. settings.py):

I had a lot of problem due to encoding with python and scrapy. To be sure to avoid every encoding decoding problems, the best thing to do is to write :

