Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy convert from unicode to utf-8

I've wrote a simple script to extract data from some site. Script works as expected but I'm not pleased with output format
Here is my code

class ArticleSpider(Spider):
    name = "article"
    allowed_domains = ["example.com"]
    start_urls = (
        "http://example.com/tag/1/page/1"
    )

    def parse(self, response):
        next_selector = response.xpath('//a[@class="next"]/@href')
        url = next_selector[1].extract()
        # url is like "tag/1/page/2"
        yield Request(urlparse.urljoin("http://example.com", url))

        item_selector = response.xpath('//h3/a/@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin("http://example.com", url),
                      callback=self.parse_article)

    def parse_article(self, response):
        item = ItemLoader(item=Article(), response=response)
        # here i extract title of every article
        item.add_xpath('title', '//h1[@class="title"]/text()')
        return item.load_item()

I'm not pleased with the output, something like:

[scrapy] DEBUG: Scraped from <200 http://example.com/tag/1/article_name> {'title': [u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"']}

I think I need to use custom ItemLoader class but I don't know how. Need your help.

TL;DR I need to convert text, scraped by Scrapy from unicode to utf-8

like image 547
GriMel Avatar asked Apr 29 '16 10:04

GriMel


1 Answers

As you can see below, this isn't much of a Scrapy issue but more of Python itself. It could also marginally be called an issue :)

$ scrapy shell http://censor.net.ua/resonance/267150/voobscheto_svoboda_zakanchivaetsya

In [7]: print response.xpath('//h1/text()').extract_first()
 "ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ"

In [8]: response.xpath('//h1/text()').extract_first()
Out[8]: u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"'

What you see is two different representations of the same thing - a unicode string.

What I would suggest is run crawls with -L INFO or add LOG_LEVEL='INFO' to your settings.py in order to not show this output in the console.

One annoying thing is that when you save as JSON, you get escaped unicode JSON e.g.

$ scrapy crawl example -L INFO -o a.jl

gives you:

$ cat a.jl
{"title": "\u00a0\"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f\""}

This is correct but it takes more space and most applications handle equally well non-escaped JSON.

Adding a few lines in your settings.py can change this behaviour:

from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
    def __init__(self, file, **kwargs):
        super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)

FEED_EXPORTERS = {
    'jsonlines': 'myproject.settings.MyJsonLinesItemExporter',
    'jl': 'myproject.settings.MyJsonLinesItemExporter',
}

Essentially what we do is just setting ensure_ascii=False for the default JSON Item Exporters. This prevents escaping. I wish there was an easier way to pass arguments to exporters but I can't see any since they are initialized with their default arguments around here. Anyway, now your JSON file has:

$ cat a.jl
{"title": " \"ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ\""}

which is better-looking, equally valid and more compact.

like image 111
neverlastn Avatar answered Oct 19 '22 01:10

neverlastn