Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java string searching ignoring accents

I am trying to write a filter function for my application that will take an input string and filter out all objects that don't match the given input in some way. The easiest way to do this would be to use String's contains method, i.e. just check if the object (the String variable in the object) contains the string specified in the filter, but this won't account for accents.

The objects in question are basically Persons, and the strings I am trying to match are names. So for example if someone searches for Joao I would expect Joáo to be included in the result set. I have already used the Collator class in my application to sort by name and it works well because it can do compare, i.e. using the UK Locale á comes before b but after a. But obvisouly it doesn't return 0 if you compare a and á because they are not equal.

So does anyone have any idea how I might be able to do this?

like image 998
DaveJohnston Avatar asked Mar 07 '10 20:03

DaveJohnston


2 Answers

Make use of java.text.Normalizer and a shot of regex to get rid of the diacritics.

public static String removeDiacriticalMarks(String string) {
    return Normalizer.normalize(string, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

Which you can use as follows:

String value = "Joáo";
String comparisonMaterial = removeDiacriticalMarks(value); // Joao
like image 136
BalusC Avatar answered Sep 26 '22 09:09

BalusC


Collator does return 0 for a and á, if you configure it to ignore diacritics:

public boolean isSame(String a, String b) {
    Collator insenstiveStringComparator = Collator.getInstance();
    insenstiveStringComparator.setStrength(Collator.PRIMARY);
    // Collator.PRIMARY also works, but is case senstive
    return insenstiveStringComparator.compare(a, b) == 0;
}

isSame("a", "á") yields true now

like image 25
Benny Bottema Avatar answered Sep 25 '22 09:09

Benny Bottema