Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a 'canonical representation' of a field meant to be for equals() method (Joshua Bloch)

Tags:

java

equals

In chapter 3, item 8:

public final class CaseInsensitiveString {
    private final String s;

    public CaseInsensitiveString(String s) {
        if (s == null)
            throw new NullPointerException();
        this.s = s;
    }

    @Override public boolean equals(Object o) {
        return o instanceof CaseInsensitiveString &&
            ((CaseInsensitiveString) o).s.equalsIgnoreCase(s);
    }
    // remainder omitted
}

After describing issues surrounding the equals() method, he goes on to talk about this class in the context of comparing fields.

For some classes, such as CaseInsensitiveString above, field comparisons are more complex than simple equality tests. If this is the case, you may want to store a canonical form of the field, so the equals() method can do cheap exact comparisons on these canonical forms rather than more costly inexact comparisons. This technique is most appropriate for immutable classes; if the object can change, you must keep the canonical form up-to-date.

So my question (and I double-checked what 'canonical' means): what is Bloch talking about? What would the canonical form be? I'm ready to be told that the answer is very simple (presumably otherwise his editor would have told him to add more) but I want to see other people say so.

He also mentions the same thing for hashCode() in the next item 9.

To give it in context, he also discusses a bad version of the equals() method for CaseInsensitiveString:

// Broken - violates symmetry
@Override public boolean equals(Object o) {
    if (o instanceof CaseInsensitiveString)
        return s.equalsIgnoreCase(
            ((CaseInsensitiveString) o).s);
    if (o instanceof String) // one-way interoperability!
        return s.equalsIgnoreCase((String) o);
    return false;
}
like image 821
Adam Avatar asked Jun 26 '14 10:06

Adam


People also ask

What is meant by canonical representation?

In this context, a canonical form is a representation such that every object has a unique representation (with canonicalization being the process through which a representation is put into its canonical form). Thus, the equality of two objects can easily be tested by testing the equality of their canonical forms.

What is canonical representation in Java?

A canonical form means that values of a particular type of resource can be described or represented in multiple ways, and one of those ways is chosen as the favored canonical form. (

What is canonical representation vulnerabilities?

Canonical representation vulnerabilities can occur when a data conversion process does not convert the data to its simplest form resulting in the possible misrepresentation of the data. The application may behave in an unexpected manner when acting on input that has not been sanitized or normalized.

What is canonical object in Java?

Defining a baseline set of functionality for objects I call this object idiom "canonical" because it represents the simplest form an object should take. The idea behind this idiom is to suggest a baseline functionality that you give by default to any object you design.


2 Answers

You should add another final field and store value s.toUpperCase() for it. This new field will be canonical representation s field. New implementation of method equals() (see code bellow) will be cheaper. This approach will work only for immutable classes.

Another point you should not forget override hashCode() if you override equals().

public final class CaseInsensitiveString {

  private final String s;
  private final String sForEquals; //field added for simplifier equals method

  public CaseInsensitiveString(String s) {
      if (s == null) {
          throw new IllegalArgumentException(); //NullPointerException() - bad practice
      }
      this.s = s;
      this.sForEquals = s.toUpperCase();
  }

  @Override
  public boolean equals(Object o) {
      return o instanceof CaseInsensitiveString &&
          ((CaseInsensitiveString) o).sForEquals.equals(this.sForEquals);
  }

  @Override
  public int hashCode(){
      return sForEquals.hashCode();
  }
  // remainder omitted
}
like image 124
Sergey Morozov Avatar answered Oct 19 '22 23:10

Sergey Morozov


The term canonical has some different usages. It refers to values that have several representations (or maybe several varying values that are equal). Then often one specific representation (or value) is chosen as canonical one.

Example: Sets of integers: canonical { 2, 3, 5 } = { 3, 5, 2 } = { 2, 2, 5, 3 } = .... .

For the plain java String there is as issue too. The same text in Unicode can be represented differently: ĉ either as one code point "\u0109"SMALL-LETTER-C-WITH-CIRCUMFLEX, or as two code points c SMALL-LETTER-C and a zero-width ^ COMBINED-DIACRITICAL-MARK-CIRCUMFLEX ("\u0063\u0302").

So even a plain String should be canonicalized in some cases:

String s = "...";
String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);

This uses Normalizer to decompose a string. This has the advantage, that one could sort and "c" and "ĉ" stay together. One could remove the combining diacritical marks with a regex and would have an ASCII version.

In fact different operating systems handle Unicode names differently, and not always version control systems respect a cross-platform canonicalisation.

Only after a Normalizer.normalize a comparison with String.equals indeed indicates Unicode text equality.

like image 41
Joop Eggen Avatar answered Oct 19 '22 21:10

Joop Eggen