Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML ASCII Case Insensitive ICU Collator

Tags:

java

icu

icu4j

I need to create a Collator which corresponds to https://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive/ i.e. it ignores the case sensitivity of ASCII A-Z and a-z characters when making comparisons.

I have attempted this with the following ICU4j RuleBasedCollator:

final RuleBasedCollator collator =
        new RuleBasedCollator("&a=A, b=B, c=C, d=D, e=E, f=F, g=G, h=H, "
                + "i=I, j=J, k=K, l=L, m=M, n=N, o=O, p=P, q=Q, r=R, s=S, t=T, "
                + "u=U, v=V, u=U, v=V, w=W, x=X, y=Y, z=Z").freeze();

However, the following comparison seems to fail, where I would expect it to succeed (i.e. return true):

final SearchIterator searchIterator = new StringSearch(
        "pu", new StringCharacterIterator("iNPut"), collator);
return searchIterator.first() >= 0;

What am I missing in my rules?

like image 719
adamretter Avatar asked Oct 30 '17 19:10

adamretter


2 Answers

  1. This W3C "collation" does not look like a Collator in the usual sense. It's an ASCII-case-insensitive matcher without ordering. I suspect that it is usually implemented with low-level code that matches ASCII letters case-insensitively and everything else precisely. See https://www.w3.org/TR/xpath-functions-31/#html-ascii-case-insensitive-collation

  2. The Collator rules probably don't do what you think they do. The comma is old syntax for a tertiary difference, so &a=A, b=B, c=C is the same as &a=A<<<b=B<<<c=C. I think you were intending something like &a=A &b=B &c=C etc.

    • http://userguide.icu-project.org/collation/customization
    • http://demo.icu-project.org/icu-bin/collation.html
    • http://www.unicode.org/reports/tr35/tr35-collation.html#Rules
like image 183
Markus Scherer Avatar answered Oct 16 '22 16:10

Markus Scherer


com.ibm.icu.text.RuleBasedCollator#compare

Returns an integer value. Value is less than zero if source is less than target, value is zero if source and target are equal, value is greater than zero if source is greater than target

String a = "Pu";
String b = "pu";

RuleBasedCollator c1 = (RuleBasedCollator) Collator.getInstance(new Locale("en", "US", ""));
RuleBasedCollator c2 = new RuleBasedCollator("& p=P");
System.out.println(c1.compare(a, b) == 0);
System.out.println(c2.compare(a, b) == 0);

Output
======
false
true

It appears that the rules is not where the problem lies, something seems to be wrong with the SearchIterator code.


If you don't have to use the SearchIterator then perhaps you could write your own 'contains' method. Maybe something like this:

boolean contains(String a, String b, RuleBasedCollator c) {
  int index = 0;
  while (index < a.length()) {
    if (a.length() < b.length()) {
      return false;
    }

    if (c.compare(a.substring(0, b.length()), b) == 0) {
      return true;
    }

    a = a.substring(1);
  }
  return false;
}

Perhaps not the best code in the world, but you get the idea.

like image 23
ParallelNoob Avatar answered Oct 16 '22 18:10

ParallelNoob