Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy text encoding

Tags:

scrapy

Here is my spider

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem

class vriskoSpider(CrawlSpider):
    name = 'vrisko'
    allowed_domains = ['vrisko.gr']
    start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)

    def parse_start_url(self, response):
        hxs = HtmlXPathSelector(response)
        vriskoit = VriskoItem()
        vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
        vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
        return vriskoit

My problem is that the returned strings are unicode and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.

Thank you in advance!

like image 980
mindcast Avatar asked Feb 07 '12 17:02

mindcast


4 Answers

Since Scrapy 1.2.0, a new setting FEED_EXPORT_ENCODING is introduced. By specifying it as utf-8, JSON output will not be escaped.

That is to add in your settings.py:

FEED_EXPORT_ENCODING = 'utf-8'
like image 68
Lacek Avatar answered Nov 04 '22 03:11

Lacek


Scrapy returns strings in unicode, not ascii. To encode all strings to utf-8, you can write:

vriskoit['eponimia'] = [s.encode('utf-8') for s in hxs.select('//a[@itemprop="name"]/text()').extract()]

But I think that you expect another result. Your code return one item with all search results. To return items for each result:

hxs = HtmlXPathSelector(response)
for eponimia, address in zip(hxs.select("//a[@itemprop='name']/text()").extract(),
                             hxs.select("//div[@class='results_address_class']/text()").extract()):
    vriskoit = VriskoItem()
    vriskoit['eponimia'] = eponimia.encode('utf-8')
    vriskoit['address'] = address.encode('utf-8')
    yield vriskoit

Update

JSON exporter writes unicode symbols escaped (e.g. \u03a4) by default, because not all streams can handle unicode. It has option to write them as unicode ensure_ascii=False (see docs for json.dumps) . But I can't find way to pass this option to standard feed exporter.

So if you want exported items to be written in utf-8 encoding, e.g. for read them in text editor, you can write custom item pipeline.

pipelines.py:

import json
import codecs

class JsonWithEncodingPipeline(object):

    def __init__(self):
        self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def spider_closed(self, spider):
        self.file.close()

Don't forget to add this pipeline to settings.py:

 ITEM_PIPELINES = ['vrisko.pipelines.JsonWithEncodingPipeline']

You can customize pipeline to write data in more human readable format, e.g. you can generate some formated report. JsonWithEncodingPipeline is just basic example.

like image 34
reclosedev Avatar answered Nov 04 '22 04:11

reclosedev


Try adding the following line to the config file for Scrapy (i.e. settings.py):

FEED_EXPORT_ENCODING = 'utf-8'
like image 10
FreeCat Avatar answered Nov 04 '22 05:11

FreeCat


I had a lot of problem due to encoding with python and scrapy. To be sure to avoid every encoding decoding problems, the best thing to do is to write :

unicode(response.body.decode(response.encoding)).encode('utf-8')
like image 4
mikeulkeul Avatar answered Nov 04 '22 05:11

mikeulkeul