Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing a utf-8 encoded string

I'm using BeautifulSoup to extract some text from an HTML but I just can't figure out how to print it properly to the screen (or to a file for that matter).

Here's how my class containing the text looks like:

class Thread(object):     def __init__(self, title, author, date, content = u""):         self.title = title         self.author = author         self.date = date         self.content = content         self.replies = []      def __unicode__(self):         s = u""          for k, v in self.__dict__.items():             s += u"%s = %s " % (k, v)          return s      def __repr__(self):         return repr(unicode(self))      __str__ = __repr__ 

When trying to print an instance of Thread here's what I see on the console:

~/python-tests $ python test.py u'date = 21:01 03/02/11 content =  author = \u05d3"\u05e8 \u05d9\u05d5\u05e0\u05d9 \u05e1\u05d8\u05d0\u05e0\u05e6\'\u05e1\u05e7\u05d5 replies = [] title = \u05de\u05d1\u05e0\u05d4 \u05d4\u05de\u05d1\u05d7\u05df ' 

Whatever I try I cannot get the output I'd like (the above text should be Hebrew). My end goal is to serialize Thread to a file (using json or pickle) and be able to read it back.

I'm running this with Python 2.6.6 on Ubuntu 10.10.

like image 888
daniel Avatar asked Mar 05 '11 10:03

daniel


People also ask

How do you make a UTF-8 string?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

What is a UTF-8 encoded string?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Does UTF-8 use 8bits?

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.


1 Answers

To output a Unicode string to a file (or the console) you need to choose a text encoding. In Python the default text encoding is ASCII, but to support Hebrew characters you need to use a different encoding, such as UTF-8:

s = unicode(your_object).encode('utf8') f.write(s) 
like image 91
Mark Byers Avatar answered Oct 07 '22 07:10

Mark Byers