Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a comparable and flexible fingerprint of an object

My situation

Say I have thousands of objects, which in this example could be movies.

I parse these movies in a lot of different ways, collecting parameters, keywords and statistics about each of them. Let's call them keys. I also assign a weight to each key, ranging from 0 to 1, depending on frequency, relevance, strength, score and so on.

As an example, here are a few keys and weights for the movie Armageddon:

"Armageddon"
------------------
disaster       0.8
bruce willis   1.0
metascore      0.2
imdb score     0.4
asteroid       1.0
action         0.8
adventure      0.9
...            ...

There could be a couple of thousands of these keys and weights, and for clarity, here's another movie:

"The Fast and the Furious"
------------------
disaster       0.1
bruce willis   0.0
metascore      0.5
imdb score     0.6
asteroid       0.0
action         0.9
adventure      0.6
...            ...

I call this a fingerprint of a movie, and I want to use them to find similar movies within my database.

I also imagine it will be possible to insert something other than a movie, like an article or a Facebook profile, and assign a fingerprint to it if I wanted to. But that shouldn't affect my question.

My problem

So I have come this far, but now comes the part I find tricky. I want to take the fingerprint above and turn it into something easily comparable and fast. I tried creating an array, where index 0 = disaster, 1 = bruce willis, 2 = metascore and their value is the weight.

It comes out something like this for my two movies above:

[ 0.8 , 1.0 , 0.2 , ... ]
[ 0.1 , 0.0 , 0.5 , ... ]

Which I have tried comparing in different ways, by just multiplying:

public double CompareFingerprints(double[] f1, double[] f2)
{
    double result = 0;

    if (f1.Length == f2.Length)
    {
        for (int i = 0; i < f1.Length; i++)
        {
            result += f1[i] * f2[i];
        }
    }

    return result;
}

or comparing:

public double CompareFingerprints(double[] f1, double[] f2)
{
    double result = 0;

    if (f1.Length == f2.Length)
    {
        for (int i = 0; i < f1.Length; i++)
        {
            result += (1 - Math.Abs(f1[i] - f2[i])) / f1.Length;
        }
    }

    return result;
}

and so on.

These have returned a very satisfying results, but they all have one problem in common: They work great for comparing two movies, but in reality, it's quite time consuming and feels like very bad practice when I want to compare a single movies fingerprint with thousands of fingerprints stored in my MSSQL database. Specially if it's supposed to work with things like autocomplete where I want to return the results in fractions of a second.

My question

Do I have the right approach here or am I reinventing the wheel in a really inefficient way? I hope my question isn't to broad for Stack Overflow, but I have narrowed it down with a few thoughts below.

A couple of thoughts

  • Should my fingerprint really be an array of weights?
  • Should I look into hashing my fingerprint? It might help with fingerprint storage, but complicate comparison. I have found some hints that this might be a valid approach, by using Locality-sensitive hashing, but the math is a bit out of my reach.
  • Should I fetch all thousands of movies from SQL and work with the result, or is there a way to implement my comparison into an SQL query and only return the top 100 hits?
  • Is sparse data representation something to look into? (Thanks Speed8ump)
  • Could I apply methods used when comparing actual fingerprints or for OCR?
  • I have heard that there is software that detects exam cheating by finding similarities in thousands of published papers and previous tests. What method do they use?

Cheers!

like image 284
Magnus Engdal Avatar asked Feb 07 '14 08:02

Magnus Engdal


3 Answers

Alternative: Feature Vector

Whet you are describing is a classical feature vector. Each column in the feature vector describes a category. Your feature vector is a sepcial kind: It has fuzzy data, describing the degree of belonging to some category.

When processing such vectors, you should apply fuzzy logic for the calculations. With fuzzy logic you have to play areound a little bit, until you find the best numericla operators to match your fuzzy operations. E.g. fuzzy AND and OR could be computed with "min" and "max" or with "*" and "+" or even with more complex exponential operations. You have to find the right balance between good results and fast computations.

Unfortunately fuzzy logic does not fit very well with SQL databases. If you go the fuzzy way, you should consider to hold all your data in memory and to use some kind of numerical processing acceleration (processor SIMD instructions, CUDA/OpenCL, FPGA, etc.).

Alternative: Star / Snowflake Schema

A different approach is to build a classical data warehouse scheme. This fits well with modern SQL databases. They have nice accelerations to retrieve data from a medium sized data warehouse (up to a few billion records):

  1. Materialized views (for data reduction)
  2. (Compressed) bitmap indexes (for fast combining of multiple features)
  3. Compressed storage (for fast transfer of huge amounts of date)
  4. Pertitioning (to physically separete data according to their features)

To use these optimizations, you must first prepare your date.

Hierarchical dimensions

You should order your features hierarchical, according to the snowflake schema. When the data is ordered this way (and you have the according indexes), the database can use a new set of optimizations, e.g. bitmap filtering.

Data organized in this way should be mainly read only. The database will need data structures that are very fast for special kinds of queries but are also very expensive to update.

An example is a bitmap index. A bitmap index is a binary matrix. The rows of the matrix are the rows of one table in your database. The columns are the possible values of one row in this table. The entry in the matrix is 1, when the column in the according row in the table as the value according to the column of the matrix. Otherwise it is 0.

A bitmap index will be stored in a compressed binary format. For the database it is very easy to combine multiple bitmap indexes by using fast binary processing (by ANDing or ORing the binary values using in processor SIMD instructions or even OpenCL/CUDA, etc.).

There are special kind of bitmap indexes that can span multiple tables, so called bitmap join indexes. They are specially built for data organized in a snowflake schema.

Dimension reduction

You should also use dimension reduction to reduce the amount of features that must be stored. For this you can use techniques like principal component analysis. With this you can combine multiple highly coupled features to one artificial features and remove features completely that don't change their value at all.

Discrete dimension members

For fuzzy logic, using floating numbers is nice. But when storing data in a data warehouse, it is a good idea to reduce to possible values. Bitmap indexes and partitioning will only work with a limited amount of values. You can use classification algorithms to reach this, e.g. self organizing feature maps or particle swarm optimizations.

Alternative 3: Hybrid approach

You can easily combine the two approaches described above. You store the date in your data warehouse, using condensed descriptions (fewer dimensions, fewer members). Each data set contains the original features. When you retrieved the data sets from the data warehouse, you can then use the techniques from alternative 1 to operate with the full descriptions, e.g. to determine the top candidates for competition according to the current context.

like image 71
stefan.schwetschke Avatar answered Oct 24 '22 04:10

stefan.schwetschke


Idea is cool, this way I can find all good (imdb > 5.5) movies with Bruce, where he play a main role (bruce willis > 0.9), which are actions (action > 0.5) and are not horrors (horror < 0.1). I hate horrors.

Your thoughts:

  • array of weights is bad, because if you get more and more keys, and if movie doesn't have this actor, then it still has to has a value (0), which is a waste of space (imagine million of keys attached to each movie).
  • hashing doesn't makes sense, as you are not going to access anything by exact value, you will always compare keys with user entered values and many of them will be optional (which means you don't care if they are 0 or 10).
  • depends, see below.

I think what you need here is a sort of Tag system (like SO one), where you can easily add new tags (to example, for new actors or when there will be something better than blue-ray or HD, etc). So a table with tag [id]-[name].

Then your movies have to have a field which stores a dictionary [id]-[score] of zero to million tags. This should be a blob (or is there any way to hold dictionary or array in SQL database?), or array (if your tag id starting from 0 and incremented by 1 you don't need key, but index).

When you are searching for movies, matching fingerprints conditions, you will have to read fingerprint from database for each movie. This should be slower than if SQL query would do it, but still ok (you will have maybe 100-1000 tags per movie, which makes it only a few KB to read), unless you have to transfer this data over network, then consider to use server application. Perhaps stored procedures can help.

like image 21
Sinatr Avatar answered Oct 24 '22 04:10

Sinatr


Fingerprint Format
Regarding your 1st question, whether you should use an array of weights, that comes down to the level of detail you want. An array of weights will offer the highest fingerprint "resolution", for lack of a better term; it allows for a much more fine-grained measurement of how similar any two given movies are. Sinatr's suggestion of using tags in place of weights has a lot of optimization potential, but it essentially limits you to weights of 0 or 1, and thus has trouble representing existing weights in the 0.3-0.7 range. You'll have to decide for yourself whether the performance gains of going to a representation with less detail outweigh the reduced comparison accuracy those representations have.

Hashes
Regarding your 2nd question, I'm afraid I can't offer much guidance. I'm not familiar with the use of hashing in this sort of context, but I don't see how you could compare them easily; the whole point of hashes in most uses is that they can't easily be reversed to learn about the original input.

SQL Optimization
For your 3rd question, the SQL query you use to get comparison candidates is probably a rich source of performance optimization potential, especially if you know some characteristics of your fingerprints. In particular if high weights or low weights are relatively rare, then you can use those to weed out a lot of poor candidates. For example, if you were using movies you would expect a lot of the weights to be 0 (most movies do not contain Bruce Willis). You could look at any weights in your candidate movie that are higher than .8 or so (you'll need to do some fine-tuning to determine the exact values that work well for your data set) and then have your SQL query exclude results that have a 0 in at least some fraction of those keys (again, the fraction will need fine-tuning). This allows you to quickly discard results that are unlikely to be good matches in the SQL query stage rather than doing a full (expensive) comparison on them.

Other Options
Another approach that might work, depending on how often the fingerprints of your objects change, is to pre-compute the fingerprint comparison values. Then getting the best candidates is a single query from an indexed table: SELECT id1, id2, comparison FROM precomputed WHERE (id1 = foo OR id2 = foo) AND comparison > cutoff ORDER BY comparison DESC. Pre-computing the comparisons for a new object would be part of the process of adding it, so if being able to quickly add objects is a priority then this approach may not work well. Alternately, you could simply cache values once you've computed them, rather than pre-computing them. This doesn't do anything for the initial search, but later searches reap the benefits and adding objects stays cheap.

like image 1
Oblivious Sage Avatar answered Oct 24 '22 04:10

Oblivious Sage