I have to match an input string against a set of prefixes. The match should be the best possible, such that if there are both abcd*
and abcde*
, then abcdef
should match abcde*
. I am using Trie for this. The problem is the character in the input and in the set of prefixes can be any Unicode character. So, the children array that we have in a simple trie won't be possible(won't be sufficiently efficient, atleast, as the array size will be very large). Using map instead of array would still be inefficient. How should I go about tackling this?
To construct the trie, you could encode the Unicode strings as UTF-8, and then construct a trie with the encoded byte sequences. Or you could work with the code points, and use a hash map in your nodes. You'll have to benchmark your application to figure out which approach works best.
But the hard problem is how to decide when two string match.
Consider the word café
This can be represented as:
A = [U+0063 U+0061 U+0066 U+0065 U+0301]
(ends with e and a combining accent)
But also as
B = [U+0063 U+0061 U+0066 U+00E9]
(end with é, the combined form)
So:
Should the strings match the prefix cafe (no accent)? A starts with that prefix, B doesn't. However A and B should either both match the prefix, or not, as both represent the same word café.
And what if you have A in your trie, and you're trying to match B? It's the same word, so should it match?
→ You may have to convert your strings to the same normalized form when inserting in your trie and when matching.
There's other issues. In German a double s is normally written as ß. Should ß and ss match or not?
And it goes on. Deciding if two Unicode strings are equal is a non-trivial problem in itself. It's up to you to decide how complex you want the matching to be, it depends on your application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With