Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching an approximate string in a Core Data store

I have a small problem with the core data application i'm currently writing. I have two differents models, contexts and peristent stores. One is for my app data, the other one is for a website with relevant infos to me.

Most of the time, I match exactly one record from my app to another record from the other source. Sometimes however, I have to fallback to fuzzy string matching to link the two records. I'm trying to match song titles. My local title could be the (made up) "The French Idealist is in your pensée" and the remote song title could be "01 - 10 - French idealist in in you're pensee, The (dub remix, feat. DJ Objective-C)"

I search stack overflow, Google, the cocoa documentation, and I can't find any clear answer on how to do a fuzzy matching in these cases. My strings can start with anything, have a bunch of special characters, usually end with random or to be ignored characters.

Regexp won't do, nor NSPredicates, Soundex doesn't work well with foreign names, and maybe the Levenshtein won't be enough (or will it ?).

I'm looking for a title in a set of about a dozen potential matches, but I hava to do this operation quite a lot. 100% accuracy is not the goal.

I was thinking of removing the ignored words, extracting the keywords (in this example, "french, idealist, pensée"), concatenate them, and then use the Levenshtein distance (words in song title should be in the same order).

In my special case, would it work ? What is the industry standard regarding this problem (I can't be the only one in the world who want to match slightly different songs names) Can Core Data, Cocoa or Objective-C help me ?

Thanks a lot.

like image 275
damdamdam Avatar asked May 19 '09 10:05

damdamdam


3 Answers

You want your search to be diacritic insensitive to match the 'é' in pensée and 'e' in pensee. You get this by adding the [d] after the attribute. Like so:

    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"(songTitle like[cd] %@)", yourSongSubstring];
The 'c' in [cd] is for case insensitivity.

Since your string could appear in any order in the string you are searching, you could tokenize your search string ([... componentsByString:@" "]) then create a predicate like

    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"(songTitle like[cd] %@) and (songTitle like[cd] %@)", songToken1, songToken2];
That syntax to combine predicates above may be off, going from memory.
like image 198
baalexander Avatar answered Nov 12 '22 16:11

baalexander


I believe the tool you want to use here is SearchKit. I say that as if I've just made your job easy.... I haven't, but it should have the tools you need to be successful here. LNC is still offering their SearchKit Podcast for free (very nice).

Each track would be a document in this case, and you'd need to come up with a good way to index them with an identifier that can be used to find them. You can then load them up with metadata, and search them. Perhaps putting the title "in" the document would be helpful here to facilitate the use of Similarity Searching (kSKSearchOptionFindSimilar). That may or may not work really well.

The question you've asked is a good one, but there is certainly no industry standard for it because anyone who solves this problem well (i.e. every major search engine) keeps their algorithms very secret. This is a hard problem; no one is quite ready to give away their answer.

like image 31
Rob Napier Avatar answered Nov 12 '22 15:11

Rob Napier


Consider q-grams, which are substrings of length q (Gravano et al., 2001).

You could, for two strings s1 and s2, determine for each q-gram of s1 the corresponding q-gram of s2 with smallest edit distance. Then add all those distances and you end up with a metric which is very robust to permutation of words and extra characters.

Generally, q should be adapted to your problem domain (experiment with q = 3, 4, 5...).

like image 1
AtoN Avatar answered Nov 12 '22 14:11

AtoN