I am looking for a way to fuzzy substring
function. What do I mean by that:
Example 1:
This is an exact match and should have score 1.0.
Example 2:
This is a fuzzy match, as "weed" and "destroyed" appear in the text, but without "will be". Still it should get a high-score (say 0.8).
Example 3:
If we set the "Short" to "destroyed will be weeds", although "destroyed" and "weeds" both appear in the original text, the score should be very low, since their order has changed.
Any suggested implementation on this?
A final point is that, there is no unique way of doing this scoring. But I am looking for AN algorithm. The parameters of this algorithm can be tuned based on the needs and requirements.
Here's one possible approach:
I'd split both strings in dependency tree (something like this). Then, recursively traversed smaller tree from root and checked whether token is present in bigger tree. If yes, then add score similarity_of_dependency_kind
. Optionally, that can be multiplied by similarity_of_destination_words
(in terms of synonymicity, something like wordnet).
This approach is less efficient, but more accurate.
Also, don't forget preliminary data cleaning, like typos correction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With