In OCaml, polymorphic comparison is implemented by walking the runtime representation of values which consists of immediates and pointers to blocks.
According to Real World Ocaml, a polymorphic variant with no parameters is just stored as an unboxed integer. Excerpt reproduced here for convenience.
A polymorphic variant without any parameters is stored as an unboxed integer and so only takes up one word of memory, just like a normal variant. This integer value is determined by applying a hash function to the name of the variant. The hash function isn't exposed directly by the compiler, but the type_conv library from Core provides an alternative implementation: ...
However, polymorphic comparison does not appear to operate on the value of the integer and appears to respect the lexicographic ordering of the name of the polymorphic variant (in the top-level at least).
# List.sort Pervasives.compare
[ `L ; `K ; `J ; `I ; `H ; `G ; `F ; `E ; `D; `C ; `B; `A ];;
[`A; `B; `C; `D; `E; `F; `G; `H; `I; `J; `K; `L]
There is one small wrinkle: the length of the representation seems to be weighted most in the ordering.
# List.sort compare [ `BBBB; `AAAA; `AAA; `ABA; `BB; `ZZ; `AA ];;
[`AA; `BB; `ZZ; `AAA; `ABA; `AAAA; `BBBB]
How is OCaml pulling this off? How is the information OCaml needs to sort the variants lexicographically still present at runtime? Shouldn't a polymorphic variant without any arguments be indistinguishable from an ordinary integer?
Did the OCaml implementers pick a hash function that, by coincidence/design, has this behavior for short variant names?
Due to its construction the hash function preserves the order for short strings. But this is not a general property.
# List.sort compare [`AAAAAAA; `BAAAAAA; `CAAAAAA];;
- : [> `AAAAAAA | `BAAAAAA | `CAAAAAA ] list =
[`BAAAAAA; `CAAAAAA; `AAAAAAA]
#
The hashing code looks like this for OCaml 4.06.0:
CAMLexport value caml_hash_variant(char const * tag)
{
value accu;
for (accu = Val_int(0); *tag != 0; tag++)
accu = Val_int(223 * Int_val(accu) + *((unsigned char *) tag));
#ifdef ARCH_SIXTYFOUR
accu = accu & Val_long(0x7FFFFFFFL);
#endif
/* Force sign extension of bit 31 for compatibility between 32 and 64-bit
platforms */
return (int32_t) accu;
}
It seems to me that for short strings of characters with codes less than 223 this will tend to preserve lexical order.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With