Case-insenstive ordered word search via regular expression

Tags:

I just started off with regular expression in perl. After playing around through various online tutorials, I wanted a write a regular expression that matches order specified case insensitive word match.

I'm trying to determine if string "A" consists of a word or a sequence of words of string "B", and I want to do this case-insensitively.

For example, if string "B" is "John Von Neumann", then "JOhn", "Von NeumaNn", "VoN", "john neuMann" would be a match, but strings like "Joh", "NeumaNn VoN", "Vonn" would not be a match.

I am not sure how to do this with regular expressions, any idea?

939

asked Feb 10 '13 17:02

Quixotic

2 Answers

Let's ignore case for a second.

John Von Neumann

can be matched by

John Von Neumann    1 1 1
John Von            1 1 0
John     Neumann    1 0 1
John                1 0 0
     Von Neumann    0 1 1
     Von            0 1 0
         Neumann    0 0 1

So the regex pattern for which you are looking is

/^(?:John Von Neumann|John Von|John Newmann|John|...)\z/i

Here's how you can build the list:

sub true_indexes {
   my ($n) = @_;
   my $i = 0;
   my @indexes;
   while ($n) {
      push @indexes, $i if $n & 1;
      ++$i;
      $n >>= 1;
   }
   return @indexes;
}

my @words = split(' ', 'John Von Neumann');

my @patterns;
unshift @patterns, join ' ', @words[ true_indexes($_) ]
   for 1 .. (2**@words)-1;

And finally, we can generate the pattern:

my $pat = join '|', map quotemeta, @patterns;
my $re = qr/$pat/i;

You'd use it like such:

if ($input =~ /^$re\z/) {
   print "match\n";
} else {
   print "no match\n";
}

157

answered Nov 05 '22 14:11

ikegami

ikegami's solution will take up exponential space for to store the string before it is turned into regex (each word will appear 2^{n - 1} times, where n is the number of words, so the total space is at least 2^{n - 1} * Sum(length of words)). This is not related to the regex engine - since the problem is before the string is turned into regex.

A regex construction equivalent (in term of the set of strings that it matches) to ikegami's solution would be:

^(?=[^ ])(?:(?: |^)John(?= |\z))?+(?:(?: |^)Von(?= |\z))?+(?:(?: |^)Neumann(?= |\z))?+\z

This only takes up linear space, in term of the number of words and the total length of all the words.

For clarity:

^
(?=[^ ])
(?:(?: |^)John(?= |\z))?+
(?:(?: |^)Von(?= |\z))?+
(?:(?: |^)Neumann(?= |\z))?+
\z

The look-ahead assertion (?=[^ ]) has 2 purposes: prevent the empty string from being matched, and make sure the first character is not a space character.

Note the ?+, which makes the quantifier possessive (or atomic), since we don't need backtracking here. Why? If we were to do this normally, we will loop through the list of words and compare it with the left most word in the input. After we find a match, we will continue the loop to compare it with the next word in the input, until all words in the input has been found or we have finished looping the list of words.

The possessive quantifier also prevent backtracking hell from happening. If a word is considered a match, it will never be reconsidered again.

For each word, they can be preceded by a space, or it is the start of the string. The look-ahead assertion (?= |\z) has the purpose of making sure words with same prefix doesn't get matched incorrectly at first try (e.g. "John Von Vone", try to match "John Vone").

Since there is no backtracking, the worst case performance is linear in term of length of all the words and the length of the input string (the same as how you would do it without regex).

We can change the regex a little bit to allow flexible spacing:

^(?= *+[^ ])(?: *+John(?= |\z))?+(?: *+Von(?= |\z))?+(?: *+Neumann(?= |\z))?+ *+\z

For clarity (leading space is significant):

^
(?= *+[^ ])
(?: *+John(?= |\z))?+
(?: *+Von(?= |\z))?+
(?: *+Neumann(?= |\z))?+
 *+
\z

The look-ahead (?= *+[^ ]) at the beginning makes sure the input string does not contain only spaces.

The regex is changed to allow any amount of spaces precede a word (backtracking disallowed by possessive quantifier). 0 or more quantifier * is used, for the case the word is right at the beginning of the string. There is no chance for 2 words to collide, due to the look-ahead assertion (?= |\z).

It still takes up linear space when constructing the string (before entering it to the regex engine). Worst case performance is also linear.

Extreme cases

The original words:

aaaaaaaaaaaaaaaaaaa0 aaaaaaaaaaaaaaaaaaa1 ... aaaaaaaaaaaaaaaaaaa9 aaaaaaaaaaaaaaaaaaaa ... aaaaaaaaaaaaaaaaaaaz aaaaaaaaaaaaaaaaaaaA ... aaaaaaaaaaaaaaaaaaaZ

(Each word is 20 characters, the last character changes from 0-9, then a-z, then A-Z)

Strings to search for (not matching):

aaaaaaaaaaaaaaaaaaaz aaaaaaaaaaaaaaaaaaay

(y can only come before z)

The original word:

patterns used in Perl pattern matching evolved from those supplied

(Some normal words)

Strings to search for (not matching):

patterns used in Perl pattern matching evolved from those suppliedd

(Extra d at the end)

The original word:

aaaaaaaaaaaa aaaaaaaaaaa aaaaaaaaaa aaaaaaaaa aaaaaaaa aaaaaaa aaaaaa aaaaa aaaa

(Word contains only a, with different length.)

Strings to search for (not matching):

aaaaaaaaaaaa aaaaaaaaaaa aaaaaaaaaa aaaaaaaaa aaaaaaaa aaaaaaa aaaaaa aaaaa aaaaa

(Extra a at the end)

The original word:

performance abc xyz performance 456 !@# performance

(Same word appearing multiple times)

Strings to search for (not matching):

performance performance performance performance

answered Nov 05 '22 14:11

nhahtdh

Related questions
                            
                                Explain this UTF-8 detection regex
                            
                                python match only captures first and last group - am I misunderstanding something?
                            
                                Regular Expression to Match a Single CSS Property
                            
                                java regular expression returning false
                            
                                regex that matches any positive or negative numeric value but no characters or mixed strings [closed]
                            
                                regular expression \Z(?ms)
                            
                                Regex for matching season and episode
                            
                                Android Java Regex Match
                            
                                Python regex uppercase unicode word
                            
                                Sort a list by digits appearing after trash digits in VIM
                            
                                Algorithm to generate context free grammar from any regex
                            
                                Regular expression which allow only characters or space
                            
                                Replace a character with different characters depending on which character it is
                            
                                Regular expression to get page title
                            
                                regex get a match
                            
                                How to replace every number in a string with a different number? [duplicate]
                            
                                phpstorm regex search and replace
                            
                                python regex: replacing <number>st, <number>nd, <number>th etc in a adress with a single sub
                            
                                Getting substring between two separators in an arbitrary position
                            
                                Javascript regex to remove string inside bracket

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Case-insenstive ordered word search via regular expression

Tags:

string

regex

search

perl

Quixotic

People also ask

2 Answers

ikegami

nhahtdh

Recent Activity

Donate For Us