Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy json response convert in utf-8 encode

I've written the following code to scrap data from a site.

import scrapy
from porua_scrapper.items import Category
from porua_scrapper.config import SITE_URL


class CategoriesSpider(scrapy.Spider):
    name = "categories"
    start_urls = []
    for I in range(2):
        url = SITE_URL + "book/categories?page=" + str(I+1)
        start_urls.append(url)

    print(start_urls)


    def parse(self, response):
        # print(response.css('ul.categoryList li div.pFIrstCatCaroItem a').extract_first())

        for category in response.css('ul.categoryList li'):
            categoryObj = Category()

            categoryObj['name'] = category.css('div.bookSubjectCaption h2::text').extract_first()
            categoryObj['url'] = category.css('a::attr(href)').extract_first()

            yield categoryObj

When I run the command scrapy crawl categories -o categories.json it'll create a categories.json file which contains the desired output format. But the problem is some of my content contains bengali text. Thus, in the generated output file I got response like:

{"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"}

How am I supposed to encode the content in utf-8? As I'm new in scrapy I didn't manage to find a suitable solution based on my scenario.

Thanks in advance!

like image 487
Emu Avatar asked Jan 04 '17 07:01

Emu


3 Answers

First of all, {"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"} is valid JSON data

>>> import json
>>> d = json.loads('''{"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"}''')
>>> print(d['name'])
নাটক

and any program interpreting this data should understand (i.e. decode) the characters just fine. Python json module calls this ensure_ascii:

If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the result is a str instance consisting of ASCII characters only.

This is what Scrapy feed exporter uses by default for JSON output.

But if you need the output JSON file to use another encoding, such as UTF-8, you can use Scrapy's FEED_EXPORT_ENCODING setting.

FEED_EXPORT_ENCODING = 'utf-8'
like image 117
paul trmbrth Avatar answered Oct 10 '22 01:10

paul trmbrth


At settings.py, add the following line: FEED_EXPORT_ENCODING = 'utf-8'

like image 42
tae ha Avatar answered Oct 10 '22 02:10

tae ha


To run in command-line use the option "--set FEED_EXPORT_ENCODING=utf-8":

scrapy runspider --set FEED_EXPORT_ENCODING=utf-8 .\TheScrapyScript.py -o TheOutputFile.json
like image 30
Thiago Dias Avatar answered Oct 10 '22 02:10

Thiago Dias