Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Given mixed accented and normal characters in string not working in java when searching

String text = "Cámélan discovered ônte red aleŕt \n Como se extingue la deuda";

If I give the input Ca, it should highlight from the given string Cá but it's not highlighting.

Below is what I tried.

 Pattern mPattern; 
  String filterTerm; //this is the input which I give from input filter. Say for eg: Ca
   String regex = createFilterRegex(filterTerm);
        mPattern = Pattern.compile(regex);

 private String createFilterRegex(String filterTerm) {
        filterTerm = Normalizer.normalize(filterTerm, Normalizer.Form.NFD);
       filterTerm = filterTerm.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
        return filterTerm;
    }

public Pattern getPattern() {
        return mPattern;
    }

In another class,

private SpannableStringBuilder createHighlightedString(String nodeText, int highlightColor) { //nodeText is the entire list displaying. 
        SpannableStringBuilder returnValue = new SpannableStringBuilder(nodeText);
        String lowercaseNodeText = nodeText;
        Matcher matcher = mFilter.getPattern().matcher((createFilterRegex(lowercaseNodeText)));
        while (matcher.find()) {
            returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                    matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
        }

        return returnValue;
    }

viewHolder.mTextView.setText(createHighlightedString((node.getText()), mHighlightColor));

But what I am getting the output as,

If I type single alphabet o alone, it's highlighting but if I pass more than two alphabets say for eg: Ca, it's not highlighting and displaying. I couldn't figure out what mistake I am doing.

But if you look WhatsApp. it has been achieved.

I typed Co, it's recognizing and highlighting accented characters in the sentence.

enter image description here

like image 834
Star Avatar asked Oct 16 '18 12:10

Star


2 Answers

As you said,

String text = "Cámélan discovered ônte red aleŕt \n Como se extingue la deuda";

So whenever you give first input, receive that first character and compare.

Eg: If you give Ca, then

if (StringUtils.isNotEmpty(substring)) { //this is the search text
substring=substring.substring(0,1); //now you get C alone.

}

So whatever you type it displays by filtering the first character. Now

 SpannableString builder = higlightString((yourContent.getText()), mHighlightColor);
    viewHolder.mTextView.setText(builder);




private SpannableString higlightString(String entireContent, int highlightColor) {
            SpannableString returnValue = new SpannableString(entireContent);

            String lowercaseNodeText = entireContent;
        try {
            Matcher matcher = mFilter.getPattern().matcher(((diacritical(lowercaseNodeText.toLowerCase()))));
            while (matcher.find()) {
                returnValue.setSpan(new ForegroundColorSpan(highlightColor), matcher.start(0),
                        matcher.end(0), Spannable.SPAN_EXCLUSIVE_INCLUSIVE);
            }
        }
        catch (Exception e){
            e.printStackTrace();
        }

            return returnValue;

    }



 private String diacritical(String original) {
       String removed=null;
           String decomposed = Normalizer.normalize(original, Normalizer.Form.NFD);
           removed = decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
       return removed;
   }

Test case:

When you give input Ca, it goes to the entire text by displaying all the C content get all the datas and filter out by normalising the content and it matches with accented characters too and display by higlighting.

like image 106
Shadow Avatar answered Oct 17 '22 08:10

Shadow


You already got:

private String convertToBasicLatin(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFD)
        .replaceAll("\\p{M}", "").replaceAll("\\R", "\n");
}

In order to have one unaccented basic latin char match one Unicode code point of an accented letter, one should normalize the to the composed form:

private String convertToComposedCodePoints(String text) {
    return Normalizer.normalize(text, Normalizer.Form.NFC).replaceAll("\\R", "\n");
}

In general one might make the assumption that the Unicode code point is 1 char long too.

  • The search key uses convertToBasicLatin(sought)
  • The text view's content uses convertToComposedCodePoints(content)
  • The text content for matching uses convertToBasicLatin(content)

Now the matcher's index positions of start and end are correct. I normalized explicitly line endings (regex \R) like \r\n or \u0085 to a single \n. One cannot normalize to lowercase/uppercase, as the number of chars might vary: German lowercase ß corresponds with uppercase SS.

String sought = ...;
String content = ...;

sought = convertToBasicLatin(sought);
String latinContent = convertToBasicLatin(content);
String composedContent = convertToComposedUnicode(content);

Matcher m = Pattern.compile(sought, Pattern.CASE_INSENSITIVE
        | Pattern.UNICODE_CASE | Pattern.UNICODE_CHARACTER_CLASS
        | Pattern.UNIX_LINES)
    .matcher(latinContent);
while (m.find()) {
    ... // One can apply `m.start()` and `m.end()` to composedContent of the view too.
}
like image 43
Joop Eggen Avatar answered Oct 17 '22 06:10

Joop Eggen