This is a more complex follow-up question to: Efficient way to look up sequential values
Each Product can have many Segment rows (thousands). Each segment has position column that starts at 1 for each product (1, 2, 3, 4, 5, etc.) and a value column that can contain any values such as (323.113, 5423.231, 873.42, 422.64, 763.1, etc.). The data is read-only.
It may help to think of the product as a song and the segments as a set of musical notes in the song.
Given a subset of contiguous segments, like a snippet of a song, I would like to identify potential matches for products. However, due to potential errors in measurements, the segments in the subset may not match the segments in the database exactly.
How can I identify product candidates by finding the segments of products which most closely match the subset of segments I have measured? Also, is a database the best medium for this type of data?
Here are just some thoughts for how I was about to approach this problem. Please don't take these as exact requirements. I am open to any kind of algorithms to make this work as best as possible. I was thinking there needs to be multiple threshold variables for determining closeness. One possibility might be to implement a proximity threshold and a match threshold.
For example, given these values:
Product A contains these segments: 11,21,13,13,15.
Measurement 1 has captured: 20,14,14,15.
Measurement 2 has captured: 11,21,78,13.
Measurement 3 has captured: 15,13,21,13,11.
If a proximity threshold allowed the measured segment to be 1 above or below the actual segment, then Measurement 1 may match Product A because, although many segments do not match exactly, they are within the proximity threshold relative to the actual values.
If a match threshold allowed for measurements with matches of 3 or more, Measurement 2 may return Product A because, although one of the segments (78) far exceeds the proximity threshold, it still matches 3 segments in the correct order and so is within the match threshold.
Measurement 3 would not match Product A because, although all measured segments exist in the actual segments, they are not within the proximity or match thresholds.
Update: One of the answers asked me to define what I mean by most closely match. I'm not exactly sure how to answer that, but I'll try to explain by continuing with the song analogy. Let's say the segments represent maximum frequencies of a recorded song. If I record that same song again it will be similar, but due to background noise and other limitations of recording equipment, some of the frequencies will match, some will be close, and a few will be way off. In this scenario, how would you define when one recording "matches" another? That's the same kind of matching logic I'm looking for to use in this problem.
From the information you posted this can be solved with the edmond's blossom v perfect match algorithm. Either you can minimize or maximize the function and it will always find the best match. Maybe you can use a brute force solution with 2 loops. The wikipedia about edmond's matching algorithm: http://en.wikipedia.org/wiki/Edmonds%27s_matching_algorithm
You need to come up with a definition for "most closely match". I don't know how anyone here can help you with that since no one here is going to know the business requirements or intricacies of the data. Your two methods both sound reasonable, but I have no idea if they actually are or not.
As for whether or not a database is the correct medium for this kind of data, I'd say that a database is probably the perfect medium for the data, but it is very like not the correct medium for processing the data. Whether it's possible or not will depend on your final solution on what constitutes "most closely match".
As a quick note, SSIS has some fuzzy match capabilities built into it for processing data. I've only played around with it though and that was a couple of years ago, so I don't know if it would work for what you're doing or not.
If you take literally your song example, one approach is to boil down your input to a bit-vector fingerprint, and then look up that fingerprint in a database as an exact match. You can increase the chances of finding a good match by extracting several fingerprints from your input and/or trying e.g. all bit-vectors that are only 1 or bit-errors away from your fingerprint.
If you have access to the ACM digital library, you can read a description of this sort of approach in "The Shazam Music Recognition service" at acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744. There is also some information at http://www.music.mcgill.ca/~alastair/621/porter11fingerprint-summary.pdf.
The input format you describe suggests that you might be able to do something with the random projection method described in http://en.wikipedia.org/wiki/Locality_sensitive_hashing.
To answer your second question, depending on exactly what a position corresponds to, you might consider boiling down the numbers to hash fingerprints made up of bits or characters, and storing these in a text search database, such as Apache Lucene.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With