Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare two numbers for "likeness"

This is part of a search function on a website. So im trying to find a way to get to the end result as fast as possible.

Have a binary number where digit order matters.

Input Number = 01001

Have a database of other binary numbers all the same length.

01000, 10110, 00000, 11111

I dont know how to write what im doing, so im going to do it more visually below.

// Zeros mean nothing & the location of a 1 matters, not the total number of 1's.    
input num > 0 1 0 0 1 = 2 possible matches
number[1] > 0 1 0 0 0 = 1 match = 50% match
number[2] > 1 0 1 1 0 = 0 match = 0% match
number[3] > 0 0 0 0 0 = 0 match = 0% match
number[4] > 1 1 1 1 1 = 2 match = 100% match

Now obviously, you could go digit by digit, number by number and compare it that way (using a loop and what not). But I was hoping there might be an algorithm or something that will help. Mostly because in the above example I only used 5 digit numbers. But im going to be routinely comparing around 100,000 numbers with 200 digits each, that's a lot of calculating.

I usually deal with php and MySQL. But if something spectacular comes up I could always learn.

like image 310
Adam Avatar asked Sep 05 '11 22:09

Adam


2 Answers

If it's possible to somehow chop up your bitstrings in integer-size chunks some elementary boolean arithmetic would do, and that kind of instructions is generally pretty fast

$matchmask = ~ ($inputval ^ $tomatch) & $inputval

What this does:

  • the xor determines the bits that are different in the inputval and tomatch
  • negation gives a value where all bits that are equal in inputval and tomatch are set
  • and that with inputval and only the bits that are 1 in both inputval and tomatch remain set.

Then count the number of bits set in the result, look at How to count the number of set bits in a 32-bit integer? for an optimal solution, easily translated into php

like image 188
fvu Avatar answered Oct 01 '22 04:10

fvu


Instead of checking each bit, you could pre-process the input and determine which bits need checking. In the worst case, this devolves into processing each bit, but for a normal distribution, you'll save some processing.

That is, for input

01001, iterate over the database and determine if number1[0] & input is non-zero, and (number1[3] >> 8) & input is non-zero, assuming 0 as the index of the LSB. How you get fast bit-shifting and anding with the large numbers is on you, however. If you detect 1s than 0s in the input, you could always invert the input and test for zero to detect coverage.

This will give you modest improvement, but it's at best a constant-time reduction of the problem. If most of your inputs are balanced between 0s and 1s, you'll halve the number of required operations. If it's more biased, you'll get better results.

like image 23
Stefan Kendall Avatar answered Oct 01 '22 06:10

Stefan Kendall