Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving set comparisons by hashing them (being overly clever..?)

After my last, failed, attempt at asking a question here I'm trying a more precise question this time:

What I have:

  1. A huge dataset (finite, but I wasted days of multi-core processing time to compute it before...) of ISet<Point>.
  2. A list of input values between 0 to 2n, n≤17

What I need:

3) A table of [1], [2] where I map every value of [2] to a value of [1]

The processing:

For this computation I have a formula, that takes a bit value (from [2]) and a set of positions (from [1]) and creates a new ISet<Point>. I need to find out which of the original set is equal to the resulting set (i.e. The "cell" in the table at "A7" might point to "B").

The naive way:

Compute the new ISet<Point> and use .Contains(mySet) or something similar on the list of values from [1]. I did that in previous versions of this proof of concept/pet project and it was dead slow when I started feeding huge numbers. Yes, I used a profiler. No, this wasn't the only slow part of the system, but I wasted a considerable amount of time in this naive lookup/mapping.

The question, finally:

Since I basically just need to remap to the input, I thought about creating a List of hashed values for the list of ISet<Point>, doing the same for my result from the processing and therefor avoiding to compare whole sets.

Is this a good idea? Would you call this premature optimization (I know that the naive way above is too slow, but should I implement something less clever first? Performance is really important here, think days of runtime again)? Any other suggestions to ease the burdon here or ideas what I should read up on?


Update: Sorry for not providing a better explanation or a sample right away.

Sample for [1] (Note: These are real possible datapoints, but obviously it's limited) :

new List<ISet<Point>>() {
  new HashSet() {new Point(0,0) },
  new HashSet() {new Point(0,0), new Point(2,1) },
  new HashSet() {new Point(0,1), new Point(3,1) }
}

[2] is just a boolean vector of the length n. For n = 2 it's

  • 0,0
  • 0,1
  • 1,0
  • 1,1

I can do that one by using an int or long, basically.

Now I have a function that takes an vector and an ISet<Point> and returns a new ISet<Point>. It's not a 1:1 transformation: An set of 5 might result in a set of 11 or whatever. The resulting ISet<Point> is however guaranteed to be part of the input.

Using letters for a set of points and numbers for the bit vectors, I'm starting with this

  A  B  C  D  E  F
1
2
3
4
5
6
7

What I need to have at the end is

  A  B  C  D  E  F
1 -  C  A  E  -  -
2 B  C  E  F  A  -
3 ................
4 ................
5 ................
6 F  C  B  A  -  - 
7 E  -  C  A  -  D

There are several costly operations in these project, one is the preparation of the sets of point ([1]). But this question is about the matching now: I can easily (more or less, not that important now) compute a target ISet for a given bit vector and a source ISet. Now I need to match/find that in the original set.

The whole beast is going to be a state machine, where the set of points is a valid state. Later I don't care about the individual states, I can actually refer to them by anything (a letter, an index, whatever). I just need to keep the associations:

1, B => C


Update: Eric asked if a HashSet would be possible. The answer is yes, but only if the dataset stays small enough. My question (hashing) is: Might it be possible/a good idea to employ a hashing algorithm for this hashset? My idea is this:

  • Walk the (lazily generated) list/sequence of ISet<Point> (I could change this type, I just want to stress that it is a mathematical set of points, no duplicates).

    • Create a simpler representation of the input (a hash?) and store it (in a hashset?)
    • Compute all target sets for this input, but only store again a simple representation (see above)
    • Discard the set
  • Fix up the mapping (equal hash = equal state)

Good idea? Problems with this? One that I could come up with is a collision (how probable is that?) - and I wouldn't know a good hashing function to begin with..

like image 467
Benjamin Podszun Avatar asked Mar 06 '10 20:03

Benjamin Podszun


1 Answers

OK, I think I understand the problem at least now. Let me see if I can rephrase.

Let's start by leaving sets out of it. We'll keep it abstract.

You have a large list L, containing instances of reference type S (for "set"). Such a list is of course logically a mapping from natural numbers N onto S.

L: N --> S

S has the property that two instances can be compared for both reference equality and value equality. That is, there can be two instances of S which are not reference equals, but logically represent the same value.

You have a function F which takes a value of type V (for "vector") and an instance of type S and produces another instance of type S.

F: (V, S) --> S

Furthermore, you know that if F is given an instance of S from L then the resulting instance of S will be value equals to something on the list, but not necessarily reference equals.

The problem you face is: given an instance s of S which is the result of a call to F, which member L(n) is value-equals to s?

Yes?

The naive method -- go down L(1), L(2), ... testing set equality along the way will be dead slow. It'll be at least linear in the size of L.

I can think of several different ways to proceed. The easiest is your initial thought: make L something other than a list. Can you make it a HashSet<S> instead of List<S>? If you implement a hashing algorithm and equality method on S then you can build a fast lookup table.

If that doesn't suit then we'll explore other options.

UPDATE:

OK, so I can see two basic ways to deal with your memory problem. (1) Keep everything in memory using data structures that are much smaller than your current implementation, or (2) change how you store stuff on disk so that you can store an "index" in memory and rapidly go to the right "page" of the disk file to extract the information you need.

You could be representing a point as a single short where the top byte is x and the bottom byte is y, instead of representing it as two ints; a savings of 75%.

A set of points could be implemented as a sorted array of shorts, which is pretty compact and easy to write a hash algorithm for.

That's probably the approach I'd go for since your data are so compressible.

like image 74
Eric Lippert Avatar answered Oct 28 '22 05:10

Eric Lippert