Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy string-matching that can "skip"? e.g. "i am (.*)." has 0 distance to "I am here."

I'm writing a Python chatbot. No matter what the technique is(Levenshtein, LCS, regex, etc.), I want a pattern like My name is [ A ]. smart enough to match strings like:

My name is Tslmy.              #Distance should = 0, and groupdict()['a'] outputs "Tslmy"
My name is Tesla Tahomana.     #Distance should = 0(!), and groupdict()['a'] outputs "Tesla Tahomana"
my  naem ist tslmy .           #With a little typo, the distance = 5, and groupdict()['a'] outputs "tslmy "

Allow me to use groupdict()['a'] to refer to what the [ A ] thing (actually (?P<identifier>match)) has captured, please.

  • In other way, I'm looking for a "Levenshtein" with omits/skippings/blanks/neglects, and pick out what has been skipped as well.
  • In another way, I'm looking for a fuzzy(a.k.a. approximate) regex that can be less strict with the pattern, still provides the good old groupdict(), as well as a "fuzziness" value (or "edit distance", required to determine "the best matched pattern to the string" later).
    This is the preferred solution, since it provides "sufficient" groupdict() if well managed.
    However, The TRE library and the REGEX library, which is found to be the closest solution, don't seem to provide a "fuzziness" value. If this can be solved, then so much the better!

Is that possible? Thanks for paying attention.

Update:

I decided to use the powerful regex module in the end, but still unable to get the "fuzziness value".

Since the question on this page is theoratically solved, appending too further will be dishonorable. So I put forward another question about this new issue, and hopes you could solve it!

like image 803
tslmy Avatar asked Jun 10 '13 04:06

tslmy


1 Answers

You could use a RegEx for the basic match:

r"My name is (\w+){1,2}."

And then use the TRE library to allow for variations.

like image 178
joel.d Avatar answered Oct 21 '22 20:10

joel.d