Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get an output in UTF-8 encoded unicode from Scrapy?

Tags:

scrapy

Bear with me. I'm writing every detail because so many parts of the toolchain do not handle Unicode gracefully and it's not clear what is failing.

PRELUDE

We first set up and use a recent Scrapy.

source ~/.scrapy_1.1.2/bin/activate

Since the terminal's default is ascii, not unicode, we set:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Also since by default Python uses ascii, we modify the encoding:

export PYTHONIOENCODING="utf_8"

Now we're ready to start a Scrapy project.

scrapy startproject myproject
cd myproject
scrapy genspider dorf PLACEHOLDER

We're told we now have a spider.

Created spider 'dorf' using template 'basic' in module:
  myproject.spiders.dorf

We modify myproject/items.py to be:

# -*- coding: utf-8 -*-
import scrapy

class MyprojectItem(scrapy.Item):
    title = scrapy.Field()

ATTEMPT 1

Now we write the spider, relying on urllib.unquote

# -*- coding: utf-8 -*-
import scrapy
import urllib
from myproject.items import MyprojectItem

class DorfSpider(scrapy.Spider):
    name = "dorf"
    allowed_domains = [u'http://en.sistercity.info/']
    start_urls = (
        u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        item = MyprojectItem()
        item['title'] = urllib.unquote(
            response.xpath('//title').extract_first().encode('ascii')
        ).decode('utf8')
        return item

And finally we use a custom item exporter (from all the way back in Oct 2011)

# -*- coding: utf-8 -*-
import json
from scrapy.exporters import BaseItemExporter

class UnicodeJsonLinesItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)

    def export_item(self, item):
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict) + '\n')

and add

FEED_EXPORTERS = {
    'json': 'myproject.exporters.UnicodeJsonLinesItemExporter',
}

to myproject/settings.py.

Now we run

~/myproject> scrapy crawl dorf -o dorf.json -t json

we get

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 25: ordinal not in range(128)

ATTEMPT 2

Another solution (the candidate solution for Scrapy 1.2?) is to use the spider

# -*- coding: utf-8 -*-
import scrapy
from myproject.items import MyprojectItem

class DorfSpider(scrapy.Spider):
    name = "dorf"
    allowed_domains = [u'http://en.sistercity.info/']
    start_urls = (
        u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        item = MyprojectItem()
        item['title'] = response.xpath('//title')[0].extract()
        return item

and the custom item exporter

# -*- coding: utf-8 -*-
from scrapy.exporters import JsonItemExporter

class Utf8JsonItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        super(Utf8JsonItemExporter, self).__init__(
            file, ensure_ascii=False, **kwargs)

with

FEED_EXPORTERS = {
    'json': 'myproject.exporters.Utf8JsonItemExporter',
}

in myproject/settings.py.

We get the following JSON file.

[
{"title": "<title>Sister cities of D\u00fcsseldorf \u2014 sistercity.info</title>"}
]

The Unicode is not UTF-8 encoded. Although this is a trivial problem for a couple of characters, it becomes a serious issue if the entire output is in a foreign language.

How can I get an output in UTF-8 unicode?

like image 989
Calaf Avatar asked Nov 30 '22 23:11

Calaf


2 Answers

In Scrapy 1.2+ there is a FEED_EXPORT_ENCODING option. When FEED_EXPORT_ENCODING = "utf-8" escaping of non-ascii symbols in JSON output is turned off.

like image 50
Mikhail Korobov Avatar answered Dec 25 '22 10:12

Mikhail Korobov


please try this on your Attempt 1 and let me know if it works (I've test it without setting all those env. variables)

def to_write(uni_str):
    return urllib.unquote(uni_str.encode('utf8')).decode('utf8')


class CitiesSpider(scrapy.Spider):
    name = "cities"
    allowed_domains = ["sitercity.info"]
    start_urls = (
        'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        for i in range(2):
            item = SimpleItem()
            item['title'] = to_write(response.xpath('//title').extract_first())
            item['url'] = to_write(response.url)
            yield item

the range(2) is for testing the json exporter, to get a list of dicts you can do this instead:

# -*- coding: utf-8 -*-
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder

class UnicodeJsonLinesItemExporter(JsonItemExporter):
    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(ensure_ascii=False, **kwargs)
        self.first_item = True
like image 21
Wilfredo Avatar answered Dec 25 '22 10:12

Wilfredo