I'm a newbie to python and scrapy and I'm following the dmoz tutorial. As a minor variant to the tutorial's suggested start URL, I chose a Japanese category from the dmoz sample site and noticed that the feed export I eventually get shows the unicode numeric values instead of the actual Japanese characters.
It seems like I need to use TextResponse somehow, but I'm not sure how to make my spider use that object instead of the base Response object.
Ultimately, I want to have an output of say
オンラインショップ (these are japanese chars)
instead of the current output of
[u'\u30aa\u30f3\u30e9\u30a4\u30f3\u30b7\u30e7\u30c3\u30d7'] (the unicodes)
If you look at my screenshot, it corresponds to cell C7, one of the text titles.
Here's my spider (identical to the one in the tutorial, except for different start_url):
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz.org"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/World/Japanese/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
settings.py:
FEED_URI = 'items.csv'
FEED_FORMAT = 'csv'
output screenshot: http://i55.tinypic.com/eplwlj.png (sorry I don't have enough SO points yet to post images)
When you scrape the text from the page it is stored in Unicode.
What you want to do is encode it into something like UTF8.
unicode_string.encode('utf-8')
Also, when you extract the text using your selector, it is stored in a list even if there is only one result, so you need to pick the first element.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With