Given mixed accented and normal characters in string not working in java when searching

Question

String text = "Cámélan discovered ônte red aleŕt 
 Como se extingue la deuda";

If I give the input Ca, it should highlight from the given string Cá but it's not highlighting.

Below is what I tried.

 Pattern mPattern; 
  String filterTerm; //this is the input which I give from input filter. Say for eg: Ca
   String regex = createFilterRegex(filterTerm);
        mPattern = Pattern.compile(regex);

 private String createFilterRegex(String filterTerm) {
        filterTerm = Normalizer.normalize(filterTerm, Normalizer.Form.NFD);
       filterTerm = filterTerm.replaceAll("[\p{InCombiningDiacriticalMarks}]", "");
        return filterTerm;
    }

public Pattern getPattern() {
        return mPattern;
    }

In another class,

private SpannableStringBuilder createHighlightedString(String nodeText, int highlightColor) { //nodeText is the entire list displaying. 
        SpannableStringBuilder returnValue = new SpannableStringBuilder(nodeText);
        String lowercaseNodeText = nodeText;
        Matcher matcher = mFilter.getPattern().matcher((createFilterRegex(lowercaseNodeText)));
        while (matcher.find()) {
            returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                    matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
        }

        return returnValue;
    }

viewHolder.mTextView.setText(createHighlightedString((node.getText()), mHighlightColor));

But what I am getting the output as,

If I type single alphabet o alone, it's highlighting but if I pass more than two alphabets say for eg: Ca, it's not highlighting and displaying. I couldn't figure out what mistake I am doing.

But if you look WhatsApp. it has been achieved.

I typed Co, it's recognizing and highlighting accented characters in the sentence.

enter image description here

Shadow · Accepted Answer

As you said,

String text = "Cámélan discovered ônte red aleŕt Como se extingue la deuda";

So whenever you give first input, receive that first character and compare.

Eg: If you give Ca, then

if (StringUtils.isNotEmpty(substring)) { //this is the search text
substring=substring.substring(0,1); //now you get C alone.

}

So whatever you type it displays by filtering the first character. Now

 SpannableString builder = higlightString((yourContent.getText()), mHighlightColor);
    viewHolder.mTextView.setText(builder);




private SpannableString higlightString(String entireContent, int highlightColor) {
            SpannableString returnValue = new SpannableString(entireContent);

            String lowercaseNodeText = entireContent;
        try {
            Matcher matcher = mFilter.getPattern().matcher(((diacritical(lowercaseNodeText.toLowerCase()))));
            while (matcher.find()) {
                returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                        matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
            }
        }
        catch (Exception e){
            e.printStackTrace();
        }

            return returnValue;

    }



 private String diacritical(String original) {
       String removed=null;
           String decomposed = Normalizer.normalize(original, Normalizer.Form.NFD);
           removed = decomposed.replaceAll("\p{InCombiningDiacriticalMarks}+", "");
       return removed;
   }

Test case:

When you give input Ca, it goes to the entire text by displaying all the C content get all the datas and filter out by normalising the content and it matches with accented characters too and display by higlighting.

Joop Eggen · Answer

You already got:

private String convertToBasicLatin(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFD)
        .replaceAll("\p{M}", "").replaceAll("\R", "
");
}

In order to have one unaccented basic latin char match one Unicode code point of an accented letter, one should normalize the to the composed form:

private String convertToComposedCodePoints(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFC).replaceAll("\R", "
");
}

In general one might make the assumption that the Unicode code point is 1 char long too.

The search key uses convertToBasicLatin(sought)
The text view's content uses convertToComposedCodePoints(content)
The text content for matching uses convertToBasicLatin(content)

Now the matcher's index positions of start and end are correct. I normalized explicitly line endings (regex \R) like or \u0085 to a single . One cannot normalize to lowercase/uppercase, as the number of chars might vary: German lowercase ß corresponds with uppercase SS.

String sought = ...;
String content = ...;

sought = convertToBasicLatin(sought);
String latinContent = convertToBasicLatin(content);
String composedContent = convertToComposedUnicode(content);

Matcher m = Pattern.compile(sought, Pattern.CASE_INSENSITIVE
        | Pattern.UNICODE_CASE | Pattern.UNICODE_CHARACTER_CLASS
        | Pattern.UNIX_LINES)
    .matcher(latinContent);
while (m.find()) {
    ... // One can apply `m.start()` and `m.end()` to composedContent of the view too.
}

Given mixed accented and normal characters in string not working in java when searching

Tags:

java

regex

android

pattern-matching

matcher

Star

2 Answers

Shadow

Joop Eggen

Recent Activity

Donate For Us

Given mixed accented and normal characters in string not working in java when searching

Tags:

java

regex

android

pattern-matching

matcher

Star

2 Answers

Shadow

Joop Eggen

Related questions

Recent Activity

Donate For Us