Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keeping Java String Offsets With Unicode Consistent in Python

We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.

As an example, we might provide the program with string data such as "lovely weather today". It provides something like the following output:

0,6
7,14
15,20

Where 0,6 are the offsets corresponding to word "lovely", 7,14 are the offsets corresponding to the word "weather" and 15,20 are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.

All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.

For example, given the string "I feel šŸ™‚ today", the Java program will output:

0,1
2,6
7,9
10,15

On the Python side, these translate to:

0,1    "I"
2,6    "feel"
7,9    "šŸ™‚ "
10,15  "oday"

Where the last index is technically invalid. Java sees "šŸ™‚" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.

Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.

So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?

like image 780
NanoWizard Avatar asked May 23 '19 17:05

NanoWizard


People also ask

What is the difference between normal strings and Unicode strings in Python?

Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world. I'll restrict my treatment of Unicode strings to the following āˆ’

How do I write Unicode code points in Python?

A good practice is to decode your bytes in UTF-8 (or an encoder that was used to create those bytes) as soon as they are loaded from a file. Run your processing on unicode code points through your Python code, and then write back into bytes into a file using UTF-8 encoder in the end. This is called Unicode Sandwich.

Is the STR type Unicode or UTF-8?

Since Python 3.3 2, the str type is represented in Unicode. Unicode characters have no representation in bytes; this is what character encoding does - a mapping from Unicode characters to bytes.

What is the use of offset in Python?

Simply said, offset is the position of the read/write pointer within the file. offset is used later on to perform operations within the text file depending on the permissions given, like read, write, etc. seek (): In Python, seek () function is used to change the position of the File Handle to a given specific position.


1 Answers

You could convert the string to a bytearray in UTF16 encoding, then use the offsets (multiplied by 2 since there are two bytes per UTF-16 code-unit) to index that array:

x = "I feel šŸ™‚ today"
y = bytearray(x, "UTF-16LE")

offsets = [(0,1),(2,6),(7,9),(10,15)]

for word in offsets:
  print(str(y[word[0]*2:word[1]*2], 'UTF-16LE'))

Output:

I
feel
šŸ™‚
today

Alternatively, you could convert every python character in the string individually to UTF-16 and count the number of code-units it takes. This lets you map the indices in terms of code-units (from Java) to indices in terms of Python characters:

from itertools import accumulate

x = "I feel šŸ™‚ today"
utf16offsets = [(0,1),(2,6),(7,9),(10,15)] # from java program

# map python string indices to an index in terms of utf-16 code units
chrLengths = [len(bytearray(ch, "UTF-16LE"))//2 for ch in x]
utf16indices = [0] + list(itertools.accumulate(chrLengths))
# reverse the map so that it maps utf16 indices to python indices
index_map = dict((x,i) for i, x in enumerate(utf16indices))

# convert the offsets from utf16 code-unit indices to python string indices
offsets = [(index_map[o[0]], index_map[o[1]]) for o in utf16offsets]

# now you can just use those indices as normal
for word in offsets:
  print(x[word[0]:word[1]])

Output:

I
feel
šŸ™‚
today

The above code is messy and can probably be made clearer, but you get the idea.

like image 182
Blorgbeard Avatar answered Oct 29 '22 13:10

Blorgbeard