Java Normalize already allows me to take accented characters and output non-accented characters. It does not, however, seem to deal with composite characters (&OElig;, Æ) very well at all. Is there a way for Java to deal with these characters natively? I'd like to prevent having to keep a Map of these characters (as that was the reason we moved to using Normalize in the first place). For example, an input of "&OElig;" should return "OE", in much the same way it already neatly decomposes characters such as "½" into "1/2".

TLDR; No, there is no way with native java to handle these uniformly. Long Answer As noted in this question, Separating Unicode ligature characters, the Java Normalizer implementation does not support all of the ligatures that exist in written language. The reason for this is because Unicode does not support all of the ligatures that exist in written language. Ligatures are a debated subject when it comes to the storage of written language because an argument can be made that they are unimportant from a data viewpoint and that they are important from a layout view point. The Data viewpoint claims that no information is lost and so it makes more sense to only use the decomposed forms and that the composed forms should not be in Unicode. The Layout viewpoint claims that the composed ligature represents the proper layout of the written form of language and so should be represented in the data with a special code. Possible Solution I would suggest creating a Service that has an interface that handles ligatures only. Supply a concrete implementation that handles all that you currently need. In the future if new implementations are needed it will be simple to add them without modifying the original code by simply adding a new JAR to the program class-path that adds the missing ligatures. The skeletal implementation may look like this. Please note I have omitted the code that actually uses the <code>ServiceLoader</code> to locate the <code>LigatureDecoder</code> and <code>LigatureEncoder</code> implementations. <pre class="prettyprint"><code>final class Ligatures { public static CharSequence compose ( CharSequence decomposedCharacters ); public static CharSequence decompose ( CharSequence composedCharacters ); } interface LigatureDecoder { CharSequence decompose ( CharSequence composedCharacters ); } interface LigatureEncoder { CharSequence compose ( CharSequence decomposedCharacters ); } </code></pre>

How to properly Normalize a String with composite characters?

1 Answers

TLDR; No, there is no way with native java to handle these uniformly.

Long Answer

As noted in this question, Separating Unicode ligature characters, the Java Normalizer implementation does not support all of the ligatures that exist in written language.

The reason for this is because Unicode does not support all of the ligatures that exist in written language. Ligatures are a debated subject when it comes to the storage of written language because an argument can be made that they are unimportant from a data viewpoint and that they are important from a layout view point.

The Data viewpoint claims that no information is lost and so it makes more sense to only use the decomposed forms and that the composed forms should not be in Unicode.

The Layout viewpoint claims that the composed ligature represents the proper layout of the written form of language and so should be represented in the data with a special code.

Possible Solution

I would suggest creating a Service that has an interface that handles ligatures only. Supply a concrete implementation that handles all that you currently need. In the future if new implementations are needed it will be simple to add them without modifying the original code by simply adding a new JAR to the program class-path that adds the missing ligatures.

The skeletal implementation may look like this.

Please note I have omitted the code that actually uses the ServiceLoader to locate the LigatureDecoder and LigatureEncoder implementations.

final class Ligatures {
  public static CharSequence compose ( CharSequence decomposedCharacters );
  public static CharSequence decompose ( CharSequence composedCharacters );
}

interface LigatureDecoder {
  CharSequence decompose ( CharSequence composedCharacters );
}

interface LigatureEncoder {
  CharSequence compose ( CharSequence decomposedCharacters );
}

answered Oct 22 '22 12:10

Zixradoom

Related questions
                            
                                FontFamily React Native App overwrite with device custom font (Samsung & Oppo)
                            
                                Generating MNIST numbers using LSTM-CGAN in TensorFlow
                            
                                androidInterface is not defined. What gives?
                            
                                Debugging closures in Swift with Xcode LLDB console
                            
                                Handling Status Dilemma
                            
                                tf.reduce_sum on GPU fails in combination with placeholder as input shape
                            
                                How to adjust branch lengths of dendrogram in matplotlib (like in astrodendro)? [Python]
                            
                                Script blocks on thread when executing a python script, but not in interactive mode
                            
                                Run an instrument test from within app and wait for result
                            
                                How to make PyCharm profiler show only timings of my source code, not any libraries?
                            
                                Printing up a line in java console (reverse of '\n')
                            
                                visual studio code add corresponding import statements for snippets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to properly Normalize a String with composite characters?

Tags:

Weckar E.

People also ask

1 Answers

Zixradoom

Recent Activity

Donate For Us