Finding if a string matches a pattern

Tags:

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:

Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?

Most (not all) of these strings are from pre-defined patterns as follows:

"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"

This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.

So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.

I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.

Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.

Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.

730

asked May 07 '13 06:05

Rakesh Pai

1 Answers

I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.

The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.

One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.

An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.

{
    "Hi": { "there": {"%s": null}},
    "What: {"a": {"lovely": {"day": {"today": null}}}},
    "Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
    "Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}

This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.

131

answered Oct 13 '22 06:10

Santosh

Related questions
                            
                                Escape a variable within a Regular Expression
                            
                                How to find out which chars are defined as alphanumeric for a given locale
                            
                                How to transform a string to lowercase with preg_replace
                            
                                Bug in JavaScript V8 regex engine when matching beginning-of-line?
                            
                                Basic regex for 16 digit numbers
                            
                                How can I transform this Backus–Naur Form expression into a Regex (.Net)?
                            
                                C++ std::regex multiline syntax
                            
                                PHP Regex to split an SQL field list
                            
                                Simplify regular expression for time literals (like "10h50m")
                            
                                Having problems matching whitespace whith MySql REGEX
                            
                                Remove hasTip javascript code from Joomla
                            
                                Google Analytics Regex - Alternative to no negative lookahead
                            
                                PHP replace all instances with single regex pattern
                            
                                Regular expression as a trigger
                            
                                Efficient algorithm for string matching with a very large pattern set
                            
                                Reg expression required for strong password [duplicate]
                            
                                Regex replace character with index of match
                            
                                Awk: How to work on multiple files.txt in folder and subfolders?
                            
                                Regular expression for extracting element from MDX Query
                            
                                ANTLR4 Lexer Matching Start of Line End Of Line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding if a string matches a pattern

Tags:

regex

pattern-matching

node.js

Rakesh Pai

People also ask

1 Answers

Santosh

Recent Activity

Donate For Us