Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle utf-8 text with Python 3?

I need to parse various text sources and then print / store it somewhere.

Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.

(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)

The following is a code example:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import feedparser

url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')

print(title)

file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()

I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:

b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'

Both on screen and file. Is there a proper way to do this ?

like image 478
Omiod Avatar asked Dec 15 '22 05:12

Omiod


1 Answers

In python3 bytes and str are two different types - and str is used to represent any type of string (also unicode), when you encode() something, you convert it from it's str representation to it's bytes representation for a specific encoding.

In your case in order to the decoded strings, you just need to remove the encode('utf-8') part:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import feedparser

url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')

print(title)

file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()
like image 128
Dean Fenster Avatar answered Dec 19 '22 11:12

Dean Fenster