Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java string searching ignoring accents - part II

Tags:

java

unicode

This question is a continuation of Java string searching ignoring accents.

The answer to the original question shows us how to remove the diacritics from strings. So, for instance, köln becomes koln. But łódź becomes łodz - note the l with stroke.

My question is how can I remove the stroke as well, so that łódź becomes lodz?

Thanks.

like image 954
mark Avatar asked Mar 03 '26 06:03

mark


2 Answers

You cannot, at least not trivially for all such letters. The letter ł is (except for appearance and its Unicode name) not linked to l at all (in Unicode at least; linguistically that's a different matter).

Your only option might be a conversion table for your use case you can fill with all the characters you need to convert.

like image 121
Joey Avatar answered Mar 05 '26 18:03

Joey


As tchrist suggested, I attempted to use ICU (V 50.1): it didn't recognize it as derived from L either. The L with stroke seems to be a special case in Unicode. Look at http://bugs.mysql.com/bug.php?id=11369 They say in Unicode 4.0 it was not connected to L, while in Unicode 4.1 it is. I wonder if anyone tested the problem with a Unicode4.1-based Java library.

like image 34
Leo141 Avatar answered Mar 05 '26 20:03

Leo141



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!