I've written the following code to scrap data from a site.
import scrapy
from porua_scrapper.items import Category
from porua_scrapper.config import SITE_URL
class CategoriesSpider(scrapy.Spider):
name = "categories"
start_urls = []
for I in range(2):
url = SITE_URL + "book/categories?page=" + str(I+1)
start_urls.append(url)
print(start_urls)
def parse(self, response):
# print(response.css('ul.categoryList li div.pFIrstCatCaroItem a').extract_first())
for category in response.css('ul.categoryList li'):
categoryObj = Category()
categoryObj['name'] = category.css('div.bookSubjectCaption h2::text').extract_first()
categoryObj['url'] = category.css('a::attr(href)').extract_first()
yield categoryObj
When I run the command scrapy crawl categories -o categories.json
it'll create a categories.json file which contains the desired output format. But the problem is some of my content contains bengali
text. Thus, in the generated output file I got response like:
{"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"}
How am I supposed to encode the content in utf-8
? As I'm new in scrapy I didn't manage to find a suitable solution based on my scenario.
Thanks in advance!
First of all, {"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"}
is valid JSON data
>>> import json
>>> d = json.loads('''{"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"}''')
>>> print(d['name'])
নাটক
and any program interpreting this data should understand (i.e. decode) the characters just fine. Python json
module calls this ensure_ascii:
If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the result is a str instance consisting of ASCII characters only.
This is what Scrapy feed exporter uses by default for JSON output.
But if you need the output JSON file to use another encoding, such as UTF-8, you can use Scrapy's FEED_EXPORT_ENCODING
setting.
FEED_EXPORT_ENCODING = 'utf-8'
At settings.py,
add the following line:
FEED_EXPORT_ENCODING = 'utf-8'
To run in command-line use the option "--set FEED_EXPORT_ENCODING=utf-8":
scrapy runspider --set FEED_EXPORT_ENCODING=utf-8 .\TheScrapyScript.py -o TheOutputFile.json
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With