Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java modified UTF-8 strings in Python

Tags:

java

python

utf-8

I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain UTF-8 strings. Java uses a modified UTF-8 encoding in DataInputStream.readUTF() which is not supported by Python (yet at least)

Can anybody point me in the right direction to construct Java modified UTF-8 strings in Python?

Update #1: To see a little more about the Java modified UTF-8, check out the readUTF() method from the DataInput interface on line 550 here, or here in the Java SE docs.

Update #2: I am trying to interface with a third-party JBoss web app which is using this modified UTF-8 format to read in strings via POST requests by calling DataInputStream.readUTF() (sorry for any confusion regarding normal Java UTF-8 string operation).

like image 991
QAZ Avatar asked Sep 08 '09 09:09

QAZ


People also ask

What is modified UTF-8?

Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented. All characters in the range \u0001 to \u007F are represented by a single byte, as follows: 0xxxxxxx.

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

Are Java strings UTF-8?

String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data.


1 Answers

You can ignore Modified UTF-8 Encoding (MUTF-8) and just treat it as UTF-8. On the Python side, you can just handle it like this,

  1. Convert the string into normal UTF-8 and stores bytes in a buffer.
  2. Write the 2-byte buffer length (not the string length) as binary in big-endian.
  3. Write the whole buffer.

I've done this in PHP and Java didn't complain about my encoding at all (at least in Java 5).

MUTF-8 is mainly used for JNI and other systems with null-terminated strings. The only difference from normal UTF-8 is how U+0000 is encoded. Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80). First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text. Secondly, DataInputStream.readUTF() doesn't enforce the encoding so it happily accepts either one.

EDIT: The Python code should look like this,

def writeUTF(data, str):
    utf8 = str.encode('utf-8')
    length = len(utf8)
    data.append(struct.pack('!H', length))
    format = '!' + str(length) + 's'
    data.append(struct.pack(format, utf8))
like image 81
ZZ Coder Avatar answered Oct 12 '22 14:10

ZZ Coder