Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the recommended implementation for hashing OLE Variants?

OLE Variants, as used by older versions of Visual Basic and pervasively in COM Automation, can store lots of different types: basic types like integers and floats, more complicated types like strings and arrays, and all the way up to IDispatch implementations and pointers in the form of ByRef variants.

Variants are also weakly typed: they convert the value to another type without warning depending on which operator you apply and what the current types are of the values passed to the operator. For example, comparing two variants, one containing the integer 1 and another containing the string "1", for equality will return True.

So assuming that I'm working with variants at the underlying data level (e.g. VARIANT in C++ or TVarData in Delphi - i.e. the big union of different possible values), how should I hash variants consistently so that they obey the right rules?

Rules:

  • Variants that hash unequally should compare as unequal, both in sorting and direct equality
  • Variants that compare as equal for both sorting and direct equality should hash as equal

It's OK if I have to use different sorting and direct comparison rules in order to make the hashing fit.

The way I'm currently working is I'm normalizing the variants to strings (if they fit), and treating them as strings, otherwise I'm working with the variant data as if it was an opaque blob, and hashing and comparing its raw bytes. That has some limitations, of course: numbers 1..10 sort as [1, 10, 2, ... 9] etc. This is mildly annoying, but it is consistent and it is very little work. However, I do wonder if there is an accepted practice for this problem.

like image 804
Barry Kelly Avatar asked Mar 19 '10 18:03

Barry Kelly


People also ask

What are hash tables used for?

A hash table is a data structure that you can use to store data in key-value format with direct access to its items in constant time. Hash tables are said to be associative, which means that for each key, data occurs at most once.

What is hashing in programming?

Hashing is the process of transforming any given key or a string of characters into another value. This is usually represented by a shorter, fixed-length value or key that represents and makes it easier to find or employ the original string. The most popular use for hashing is the implementation of hash tables.


1 Answers

There's a built in tension in your question between the use of a hash function and the stated requirements, which are to validated against the input of the hash. I'd suggest we keep in mind a few properties of hashes in general: information is lost during the hashing process, and hash collisions are to be expected. It is possible to construct a perfect hash without collisions, but it would be problematic (or impossible?) to construct a perfect hash function if the domain of the function is any possible OLE Variant. On the other hand, if we're not talking about a perfect hash, then your first rule is violated.

I don't know the larger context of what you're trying to accomplish, but I must push back on one of your assumptions: is a hash function really what you want? Your requirements could be met in a fairly straightforward way if you develop a system that encodes, not hashes, all of the possible OLE Variant attributes so that they can be recalled later and compared against other Variant images.

Your baseline implementation of converting the Variant to a string representation is moving in this direction. As you are no doubt aware, a Variant can contain pointers, double pointers, and arrays, so you'll have to develop consistent string representation of these data types. I question whether this approach could really be classified as a hash. Aren't you just persisting data attributes?

like image 184
Paul Keister Avatar answered Sep 27 '22 18:09

Paul Keister