In chapter 3, item 8:
public final class CaseInsensitiveString {
private final String s;
public CaseInsensitiveString(String s) {
if (s == null)
throw new NullPointerException();
this.s = s;
}
@Override public boolean equals(Object o) {
return o instanceof CaseInsensitiveString &&
((CaseInsensitiveString) o).s.equalsIgnoreCase(s);
}
// remainder omitted
}
After describing issues surrounding the equals()
method, he goes on to talk about this class in the context of comparing fields.
For some classes, such as CaseInsensitiveString above, field comparisons are more complex than simple equality tests. If this is the case, you may want to store a canonical form of the field, so the
equals()
method can do cheap exact comparisons on these canonical forms rather than more costly inexact comparisons. This technique is most appropriate for immutable classes; if the object can change, you must keep the canonical form up-to-date.
So my question (and I double-checked what 'canonical' means): what is Bloch talking about? What would the canonical form be? I'm ready to be told that the answer is very simple (presumably otherwise his editor would have told him to add more) but I want to see other people say so.
He also mentions the same thing for hashCode()
in the next item 9.
To give it in context, he also discusses a bad version of the equals()
method for CaseInsensitiveString
:
// Broken - violates symmetry
@Override public boolean equals(Object o) {
if (o instanceof CaseInsensitiveString)
return s.equalsIgnoreCase(
((CaseInsensitiveString) o).s);
if (o instanceof String) // one-way interoperability!
return s.equalsIgnoreCase((String) o);
return false;
}
In this context, a canonical form is a representation such that every object has a unique representation (with canonicalization being the process through which a representation is put into its canonical form). Thus, the equality of two objects can easily be tested by testing the equality of their canonical forms.
A canonical form means that values of a particular type of resource can be described or represented in multiple ways, and one of those ways is chosen as the favored canonical form. (
Canonical representation vulnerabilities can occur when a data conversion process does not convert the data to its simplest form resulting in the possible misrepresentation of the data. The application may behave in an unexpected manner when acting on input that has not been sanitized or normalized.
Defining a baseline set of functionality for objects I call this object idiom "canonical" because it represents the simplest form an object should take. The idea behind this idiom is to suggest a baseline functionality that you give by default to any object you design.
You should add another final
field and store value s.toUpperCase()
for it.
This new field will be canonical representation s
field. New implementation of method equals()
(see code bellow) will be cheaper. This approach will work only for immutable classes.
Another point you should not forget override hashCode()
if you override equals()
.
public final class CaseInsensitiveString {
private final String s;
private final String sForEquals; //field added for simplifier equals method
public CaseInsensitiveString(String s) {
if (s == null) {
throw new IllegalArgumentException(); //NullPointerException() - bad practice
}
this.s = s;
this.sForEquals = s.toUpperCase();
}
@Override
public boolean equals(Object o) {
return o instanceof CaseInsensitiveString &&
((CaseInsensitiveString) o).sForEquals.equals(this.sForEquals);
}
@Override
public int hashCode(){
return sForEquals.hashCode();
}
// remainder omitted
}
The term canonical has some different usages. It refers to values that have several representations (or maybe several varying values that are equal). Then often one specific representation (or value) is chosen as canonical one.
Example: Sets of integers: canonical { 2, 3, 5 } = { 3, 5, 2 } = { 2, 2, 5, 3 } = .... .
For the plain java String there is as issue too. The same text in Unicode can be represented differently: ĉ
either as one code point "\u0109"
SMALL-LETTER-C-WITH-CIRCUMFLEX, or as two code points c
SMALL-LETTER-C and a zero-width ^
COMBINED-DIACRITICAL-MARK-CIRCUMFLEX ("\u0063\u0302"
).
So even a plain String should be canonicalized in some cases:
String s = "...";
String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
This uses Normalizer to decompose a string. This has the advantage, that one could sort and "c" and "ĉ" stay together. One could remove the combining diacritical marks with a regex and would have an ASCII version.
In fact different operating systems handle Unicode names differently, and not always version control systems respect a cross-platform canonicalisation.
Only after a Normalizer.normalize
a comparison with String.equals
indeed indicates Unicode text equality.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With