I need to parse various text sources and then print / store it somewhere.
Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.
(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)
The following is a code example:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')
print(title)
file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()
I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:
b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'
Both on screen and file. Is there a proper way to do this ?
In python3 bytes
and str
are two different types - and str
is used to represent any type of string (also unicode), when you encode()
something, you convert it from it's str
representation to it's bytes
representation for a specific encoding.
In your case in order to the decoded strings, you just need to remove the encode('utf-8')
part:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import feedparser
url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')
print(title)
file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With