Keeping Java String Offsets With Unicode Consistent in Python

Tags:

We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.

As an example, we might provide the program with string data such as "lovely weather today". It provides something like the following output:

0,6
7,14
15,20

Where 0,6 are the offsets corresponding to word "lovely", 7,14 are the offsets corresponding to the word "weather" and 15,20 are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.

All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.

For example, given the string "I feel 🙂 today", the Java program will output:

0,1
2,6
7,9
10,15

On the Python side, these translate to:

0,1    "I"
2,6    "feel"
7,9    "🙂 "
10,15  "oday"

Where the last index is technically invalid. Java sees "🙂" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.

Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.

So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?

780

asked May 23 '19 17:05

NanoWizard

1 Answers

You could convert the string to a bytearray in UTF16 encoding, then use the offsets (multiplied by 2 since there are two bytes per UTF-16 code-unit) to index that array:

x = "I feel 🙂 today"
y = bytearray(x, "UTF-16LE")

offsets = [(0,1),(2,6),(7,9),(10,15)]

for word in offsets:
  print(str(y[word[0]*2:word[1]*2], 'UTF-16LE'))

Output:

I
feel
🙂
today

Alternatively, you could convert every python character in the string individually to UTF-16 and count the number of code-units it takes. This lets you map the indices in terms of code-units (from Java) to indices in terms of Python characters:

from itertools import accumulate

x = "I feel 🙂 today"
utf16offsets = [(0,1),(2,6),(7,9),(10,15)] # from java program

# map python string indices to an index in terms of utf-16 code units
chrLengths = [len(bytearray(ch, "UTF-16LE"))//2 for ch in x]
utf16indices = [0] + list(itertools.accumulate(chrLengths))
# reverse the map so that it maps utf16 indices to python indices
index_map = dict((x,i) for i, x in enumerate(utf16indices))

# convert the offsets from utf16 code-unit indices to python string indices
offsets = [(index_map[o[0]], index_map[o[1]]) for o in utf16offsets]

# now you can just use those indices as normal
for word in offsets:
  print(x[word[0]:word[1]])

Output:

I
feel
🙂
today

The above code is messy and can probably be made clearer, but you get the idea.

182

answered Oct 29 '22 13:10

Blorgbeard

Related questions
                            
                                Difference between estimatedSize and getExactSizeIfKnown in Spliterator
                            
                                Interrupt if API call to payment processor takes over 60 seconds
                            
                                How to use dynamic property names for a Json object
                            
                                Volley long HTTPS Request never responsed
                            
                                How can we test for the N+1 problem in JPA/Hibernate?
                            
                                How to handle logging of null values in a Java Optional chain
                            
                                Not able to register webhook via postman in twitter app
                            
                                JPA: Error: Attempt to recreate a file for type <MyClass>
                            
                                Find usages of lombok generated getter/setter in Intellij
                            
                                Why is the progress bar in a webview displayed with a delay in Android 9.0?
                            
                                JavaFX: Right click on TableColumn disables resizing
                            
                                byte getting read wrong from file?
                            
                                Does Spring's @RequestScope automatically handle proxying when injected in singleton beans?
                            
                                Creating an instance from String in Java
                            
                                How do I get the @RolesAllowed annotation to work for my Web application?
                            
                                Eclipse + Turn an Existing Project into a JPA Project
                            
                                java just curly braces
                            
                                ant junit task does not report detail
                            
                                Volley JSONException: End of input at character 0 of
                            
                                Add milliseconds to Java date, when milliseconds is long

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Keeping Java String Offsets With Unicode Consistent in Python

Tags:

java

python

string

unicode

NanoWizard

People also ask

1 Answers

Blorgbeard

Recent Activity

Donate For Us