I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python? Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python. Latter edit: pprint does that pretty well.

<pre class="prettyprint"><code>try: string.decode('utf-8') print "string is UTF-8, length %d bytes" % len(string) except UnicodeError: print "string is not UTF-8" </code></pre> In Python 2, <code>str</code> is a sequence of bytes and <code>unicode</code> is a sequence of characters. You use <code>str.decode</code> to decode a byte sequence to <code>unicode</code>, and <code>unicode.encode</code> to encode a sequence of characters to <code>str</code>. So for example, <code>u"é"</code> is the unicode string containing the single character U+00E9 and can also be written <code>u"\xe9"</code>; encoding into UTF-8 gives the byte sequence <code>"\xc3\xa9"</code>. In Python 3, this is changed; <code>bytes</code> is a sequence of bytes and <code>str</code> is a sequence of characters.

To Check if Unicode <pre class="prettyprint"><code>>>>a = u'F' >>>isinstance(a, unicode) True </code></pre> To Check if it is UTF-8 or ASCII <pre class="prettyprint"><code>>>>import chardet >>>encoding = chardet.detect('AA') >>>encoding['encoding'] 'ascii' </code></pre>

Test a string if it's Unicode, which UTF standard is and get its length in bytes?

I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python?

Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.

Latter edit: pprint does that pretty well.

How many bytes is a Unicode character?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

How do I test Unicode?

To test if a program is fully Unicode compliant, write text mixing different languages in different directions and characters with diacritics, especially in Persian characters. Try also decomposed characters, for example: {e, U+0301} (decomposed form of é, U+00E9).

How many bytes is a UTF-16 character?

Likewise, UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes).

How many bytes is a string in UTF-8?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.

try:
    string.decode('utf-8')
    print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
    print "string is not UTF-8"

In Python 2, str is a sequence of bytes and unicode is a sequence of characters. You use str.decode to decode a byte sequence to unicode, and unicode.encode to encode a sequence of characters to str. So for example, u"é" is the unicode string containing the single character U+00E9 and can also be written u"\xe9"; encoding into UTF-8 gives the byte sequence "\xc3\xa9".

In Python 3, this is changed; bytes is a sequence of bytes and str is a sequence of characters.

To Check if Unicode

>>>a = u'F'
>>>isinstance(a, unicode)
True

To Check if it is UTF-8 or ASCII

>>>import chardet
>>>encoding = chardet.detect('AA')
>>>encoding['encoding']
'ascii'

I would definitely recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!), if you haven't already read it.

For Python's Unicode and encoding/decoding machinery, start here. To get the byte-length of a Unicode string encoded in utf-8, you could do:

print len(my_unicode_string.encode('utf-8'))

Your question is tagged python-2.5, but be aware that this changes somewhat in Python 3+.

Test a string if it's Unicode, which UTF standard is and get its length in bytes?

Tags:

python

string

unicode

utf-8

python-2.5

Eduard Florinescu

People also ask

3 Answers

ecatmur

Rakesh

thebjorn

Recent Activity

Donate For Us

Test a string if it's Unicode, which UTF standard is and get its length in bytes?

Tags:

python

string

unicode

utf-8

python-2.5

Eduard Florinescu

People also ask

3 Answers

ecatmur

Rakesh

thebjorn

Related questions

Recent Activity

Donate For Us