Here is my spider
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem
class vriskoSpider(CrawlSpider):
name = 'vrisko'
allowed_domains = ['vrisko.gr']
start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)
def parse_start_url(self, response):
hxs = HtmlXPathSelector(response)
vriskoit = VriskoItem()
vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
return vriskoit
My problem is that the returned strings are unicode and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.
Thank you in advance!
Since Scrapy 1.2.0, a new setting FEED_EXPORT_ENCODING
is introduced. By specifying it as utf-8
, JSON output will not be escaped.
That is to add in your settings.py
:
FEED_EXPORT_ENCODING = 'utf-8'
Scrapy returns strings in unicode, not ascii. To encode all strings to utf-8, you can write:
vriskoit['eponimia'] = [s.encode('utf-8') for s in hxs.select('//a[@itemprop="name"]/text()').extract()]
But I think that you expect another result. Your code return one item with all search results. To return items for each result:
hxs = HtmlXPathSelector(response)
for eponimia, address in zip(hxs.select("//a[@itemprop='name']/text()").extract(),
hxs.select("//div[@class='results_address_class']/text()").extract()):
vriskoit = VriskoItem()
vriskoit['eponimia'] = eponimia.encode('utf-8')
vriskoit['address'] = address.encode('utf-8')
yield vriskoit
Update
JSON exporter writes unicode symbols escaped (e.g. \u03a4
) by default, because not all streams can handle unicode. It has option to write them as unicode ensure_ascii=False
(see docs for json.dumps) . But I can't find way to pass this option to standard feed exporter.
So if you want exported items to be written in utf-8
encoding, e.g. for read them in text editor, you can write custom item pipeline.
pipelines.py:
import json
import codecs
class JsonWithEncodingPipeline(object):
def __init__(self):
self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
Don't forget to add this pipeline to settings.py:
ITEM_PIPELINES = ['vrisko.pipelines.JsonWithEncodingPipeline']
You can customize pipeline to write data in more human readable format, e.g. you can generate some formated report. JsonWithEncodingPipeline
is just basic example.
Try adding the following line to the config file for Scrapy (i.e. settings.py):
FEED_EXPORT_ENCODING = 'utf-8'
I had a lot of problem due to encoding with python and scrapy. To be sure to avoid every encoding decoding problems, the best thing to do is to write :
unicode(response.body.decode(response.encoding)).encode('utf-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With