Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy output feed international unicode characters (e.g. Japanese chars)

I'm a newbie to python and scrapy and I'm following the dmoz tutorial. As a minor variant to the tutorial's suggested start URL, I chose a Japanese category from the dmoz sample site and noticed that the feed export I eventually get shows the unicode numeric values instead of the actual Japanese characters.

It seems like I need to use TextResponse somehow, but I'm not sure how to make my spider use that object instead of the base Response object.

  1. How should I modify my code to show the Japanese chars in my output?
  2. How do I get rid of the square brackers, the single quotes, and the 'u' that's wrapping my output values?

Ultimately, I want to have an output of say

オンラインショップ (these are japanese chars)

instead of the current output of

[u'\u30aa\u30f3\u30e9\u30a4\u30f3\u30b7\u30e7\u30c3\u30d7'] (the unicodes)

If you look at my screenshot, it corresponds to cell C7, one of the text titles.

Here's my spider (identical to the one in the tutorial, except for different start_url):

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from dmoz.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz.org"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/World/Japanese/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('a/text()').extract()
           item['link'] = site.select('a/@href').extract()
           item['desc'] = site.select('text()').extract()
           items.append(item)
       return items

settings.py:

FEED_URI = 'items.csv'
FEED_FORMAT = 'csv'

output screenshot: http://i55.tinypic.com/eplwlj.png (sorry I don't have enough SO points yet to post images)

like image 581
fortuneRice Avatar asked May 31 '11 18:05

fortuneRice


1 Answers

When you scrape the text from the page it is stored in Unicode.

What you want to do is encode it into something like UTF8.

unicode_string.encode('utf-8')

Also, when you extract the text using your selector, it is stored in a list even if there is only one result, so you need to pick the first element.

like image 57
Acorn Avatar answered Oct 16 '22 07:10

Acorn