I need to create a Collator which corresponds to https://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive/ i.e. it ignores the case sensitivity of ASCII A-Z
and a-z
characters when making comparisons.
I have attempted this with the following ICU4j RuleBasedCollator
:
final RuleBasedCollator collator =
new RuleBasedCollator("&a=A, b=B, c=C, d=D, e=E, f=F, g=G, h=H, "
+ "i=I, j=J, k=K, l=L, m=M, n=N, o=O, p=P, q=Q, r=R, s=S, t=T, "
+ "u=U, v=V, u=U, v=V, w=W, x=X, y=Y, z=Z").freeze();
However, the following comparison seems to fail, where I would expect it to succeed (i.e. return true
):
final SearchIterator searchIterator = new StringSearch(
"pu", new StringCharacterIterator("iNPut"), collator);
return searchIterator.first() >= 0;
What am I missing in my rules?
This W3C "collation" does not look like a Collator in the usual sense. It's an ASCII-case-insensitive matcher without ordering. I suspect that it is usually implemented with low-level code that matches ASCII letters case-insensitively and everything else precisely. See https://www.w3.org/TR/xpath-functions-31/#html-ascii-case-insensitive-collation
The Collator rules probably don't do what you think they do. The comma is old syntax for a tertiary difference, so &a=A, b=B, c=C
is the same as &a=A<<<b=B<<<c=C
. I think you were intending something like &a=A &b=B &c=C
etc.
com.ibm.icu.text.RuleBasedCollator#compare
Returns an integer value. Value is less than zero if source is less than target, value is zero if source and target are equal, value is greater than zero if source is greater than target
String a = "Pu";
String b = "pu";
RuleBasedCollator c1 = (RuleBasedCollator) Collator.getInstance(new Locale("en", "US", ""));
RuleBasedCollator c2 = new RuleBasedCollator("& p=P");
System.out.println(c1.compare(a, b) == 0);
System.out.println(c2.compare(a, b) == 0);
Output
======
false
true
It appears that the rules is not where the problem lies, something seems to be wrong with the SearchIterator code.
If you don't have to use the SearchIterator then perhaps you could write your own 'contains' method. Maybe something like this:
boolean contains(String a, String b, RuleBasedCollator c) {
int index = 0;
while (index < a.length()) {
if (a.length() < b.length()) {
return false;
}
if (c.compare(a.substring(0, b.length()), b) == 0) {
return true;
}
a = a.substring(1);
}
return false;
}
Perhaps not the best code in the world, but you get the idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With