I'm using BeautifulSoup to extract some text from an HTML but I just can't figure out how to print it properly to the screen (or to a file for that matter).
Here's how my class containing the text looks like:
class Thread(object): def __init__(self, title, author, date, content = u""): self.title = title self.author = author self.date = date self.content = content self.replies = [] def __unicode__(self): s = u"" for k, v in self.__dict__.items(): s += u"%s = %s " % (k, v) return s def __repr__(self): return repr(unicode(self)) __str__ = __repr__
When trying to print an instance of Thread
here's what I see on the console:
~/python-tests $ python test.py u'date = 21:01 03/02/11 content = author = \u05d3"\u05e8 \u05d9\u05d5\u05e0\u05d9 \u05e1\u05d8\u05d0\u05e0\u05e6\'\u05e1\u05e7\u05d5 replies = [] title = \u05de\u05d1\u05e0\u05d4 \u05d4\u05de\u05d1\u05d7\u05df '
Whatever I try I cannot get the output I'd like (the above text should be Hebrew). My end goal is to serialize Thread
to a file (using json or pickle) and be able to read it back.
I'm running this with Python 2.6.6 on Ubuntu 10.10.
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
To output a Unicode string to a file (or the console) you need to choose a text encoding. In Python the default text encoding is ASCII, but to support Hebrew characters you need to use a different encoding, such as UTF-8:
s = unicode(your_object).encode('utf8') f.write(s)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With