Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:

  1. Hi there, John.
  2. What a lovely day today!
  3. Lovely sunset today, John, isn't it?
  4. Will you be meeting Linda today, John?

Most (not all) of these strings are from pre-defined patterns as follows:

  1. "Hi there, %s."
  2. "What a lovely day today!"
  3. "Lovely sunset today, %s, isn't it?"
  4. "Will you be meeting %s today, %s?"

This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.

So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.

I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.

Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.

Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.

like image 730
Rakesh Pai Avatar asked May 07 '13 06:05

Rakesh Pai


People also ask

How do you check if a string matches a pattern in Python?

Method : Using join regex + loop + re.match() In this, we create a new regex string by joining all the regex list and then match the string against it to check for match using match() with any of the element of regex list.

How do you check whether a string matches a regex in Java?

Variant 1: String matches() This method tells whether or not this string matches the given regular expression. An invocation of this method of the form str. matches(regex) yields exactly the same result as the expression Pattern. matches(regex, str).

How do I match a pattern in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What property can we use to check if a string matches a regular expression?

The method str. match(regexp) finds matches for regexp in the string str . If the regexp has flag g , then it returns an array of all matches as strings, without capturing groups and other details. If there are no matches, no matter if there's flag g or not, null is returned.


1 Answers

I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.

The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.

One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.

An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.

{
    "Hi": { "there": {"%s": null}},
    "What: {"a": {"lovely": {"day": {"today": null}}}},
    "Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
    "Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}

This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.

like image 131
Santosh Avatar answered Oct 13 '22 06:10

Santosh