We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.
As an example, we might provide the program with string data such as "lovely weather today"
. It provides something like the following output:
0,6
7,14
15,20
Where 0,6
are the offsets corresponding to word "lovely", 7,14
are the offsets corresponding to the word "weather" and 15,20
are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.
All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.
For example, given the string "I feel š today"
, the Java program will output:
0,1
2,6
7,9
10,15
On the Python side, these translate to:
0,1 "I"
2,6 "feel"
7,9 "š "
10,15 "oday"
Where the last index is technically invalid. Java sees "š" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.
Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.
So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?
Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world. I'll restrict my treatment of Unicode strings to the following ā
A good practice is to decode your bytes in UTF-8 (or an encoder that was used to create those bytes) as soon as they are loaded from a file. Run your processing on unicode code points through your Python code, and then write back into bytes into a file using UTF-8 encoder in the end. This is called Unicode Sandwich.
Since Python 3.3 2, the str type is represented in Unicode. Unicode characters have no representation in bytes; this is what character encoding does - a mapping from Unicode characters to bytes.
Simply said, offset is the position of the read/write pointer within the file. offset is used later on to perform operations within the text file depending on the permissions given, like read, write, etc. seek (): In Python, seek () function is used to change the position of the File Handle to a given specific position.
You could convert the string to a bytearray in UTF16 encoding, then use the offsets (multiplied by 2 since there are two bytes per UTF-16 code-unit) to index that array:
x = "I feel š today"
y = bytearray(x, "UTF-16LE")
offsets = [(0,1),(2,6),(7,9),(10,15)]
for word in offsets:
print(str(y[word[0]*2:word[1]*2], 'UTF-16LE'))
Output:
I
feel
š
today
Alternatively, you could convert every python character in the string individually to UTF-16 and count the number of code-units it takes. This lets you map the indices in terms of code-units (from Java) to indices in terms of Python characters:
from itertools import accumulate
x = "I feel š today"
utf16offsets = [(0,1),(2,6),(7,9),(10,15)] # from java program
# map python string indices to an index in terms of utf-16 code units
chrLengths = [len(bytearray(ch, "UTF-16LE"))//2 for ch in x]
utf16indices = [0] + list(itertools.accumulate(chrLengths))
# reverse the map so that it maps utf16 indices to python indices
index_map = dict((x,i) for i, x in enumerate(utf16indices))
# convert the offsets from utf16 code-unit indices to python string indices
offsets = [(index_map[o[0]], index_map[o[1]]) for o in utf16offsets]
# now you can just use those indices as normal
for word in offsets:
print(x[word[0]:word[1]])
Output:
I
feel
š
today
The above code is messy and can probably be made clearer, but you get the idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With