Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Case-insenstive ordered word search via regular expression

I just started off with regular expression in perl. After playing around through various online tutorials, I wanted a write a regular expression that matches order specified case insensitive word match.

I'm trying to determine if string "A" consists of a word or a sequence of words of string "B", and I want to do this case-insensitively.

For example, if string "B" is "John Von Neumann", then "JOhn", "Von NeumaNn", "VoN", "john neuMann" would be a match, but strings like "Joh", "NeumaNn VoN", "Vonn" would not be a match.

I am not sure how to do this with regular expressions, any idea?

like image 939
Quixotic Avatar asked Feb 10 '13 17:02

Quixotic


People also ask

How do you make a word case insensitive in regular expressions?

Regular expression pattern matching may be case sensitive. To make a word or phrase case insensitive, use the regular expression /i . For example, /bad language/i will block all instances of "bad language", regardless of case.

How do I search for a word in a regular expression?

Choose View->Show Search Options to show the Search Options pane. Expand the Search Mode drop-down and choose Regular Expressions or MS Word Wildcards. You will notice that an icon will appear next to the Source Term and Target Term fields to indicate that you are in the selected mode.

How do you make a case sensitive in regex?

In Java, by default, the regular expression (regex) matching is case sensitive. To enable the regex case insensitive matching, add (?) prefix or enable the case insensitive flag directly in the Pattern. compile() .

Are regex expressions case sensitive?

By default, the comparison of an input string with any literal characters in a regular expression pattern is case-sensitive, white space in a regular expression pattern is interpreted as literal white-space characters, and capturing groups in a regular expression are named implicitly as well as explicitly.


2 Answers

Let's ignore case for a second.

John Von Neumann

can be matched by

John Von Neumann    1 1 1
John Von            1 1 0
John     Neumann    1 0 1
John                1 0 0
     Von Neumann    0 1 1
     Von            0 1 0
         Neumann    0 0 1

So the regex pattern for which you are looking is

/^(?:John Von Neumann|John Von|John Newmann|John|...)\z/i

Here's how you can build the list:

sub true_indexes {
   my ($n) = @_;
   my $i = 0;
   my @indexes;
   while ($n) {
      push @indexes, $i if $n & 1;
      ++$i;
      $n >>= 1;
   }
   return @indexes;
}

my @words = split(' ', 'John Von Neumann');

my @patterns;
unshift @patterns, join ' ', @words[ true_indexes($_) ]
   for 1 .. (2**@words)-1;

And finally, we can generate the pattern:

my $pat = join '|', map quotemeta, @patterns;
my $re = qr/$pat/i;

You'd use it like such:

if ($input =~ /^$re\z/) {
   print "match\n";
} else {
   print "no match\n";
}
like image 157
ikegami Avatar answered Nov 05 '22 14:11

ikegami


ikegami's solution will take up exponential space for to store the string before it is turned into regex (each word will appear 2n - 1 times, where n is the number of words, so the total space is at least 2n - 1 * Sum(length of words)). This is not related to the regex engine - since the problem is before the string is turned into regex.


A regex construction equivalent (in term of the set of strings that it matches) to ikegami's solution would be:

^(?=[^ ])(?:(?: |^)John(?= |\z))?+(?:(?: |^)Von(?= |\z))?+(?:(?: |^)Neumann(?= |\z))?+\z

This only takes up linear space, in term of the number of words and the total length of all the words.

For clarity:

^
(?=[^ ])
(?:(?: |^)John(?= |\z))?+
(?:(?: |^)Von(?= |\z))?+
(?:(?: |^)Neumann(?= |\z))?+
\z

The look-ahead assertion (?=[^ ]) has 2 purposes: prevent the empty string from being matched, and make sure the first character is not a space character.

Note the ?+, which makes the quantifier possessive (or atomic), since we don't need backtracking here. Why? If we were to do this normally, we will loop through the list of words and compare it with the left most word in the input. After we find a match, we will continue the loop to compare it with the next word in the input, until all words in the input has been found or we have finished looping the list of words.

The possessive quantifier also prevent backtracking hell from happening. If a word is considered a match, it will never be reconsidered again.

For each word, they can be preceded by a space, or it is the start of the string. The look-ahead assertion (?= |\z) has the purpose of making sure words with same prefix doesn't get matched incorrectly at first try (e.g. "John Von Vone", try to match "John Vone").

Since there is no backtracking, the worst case performance is linear in term of length of all the words and the length of the input string (the same as how you would do it without regex).


We can change the regex a little bit to allow flexible spacing:

^(?= *+[^ ])(?: *+John(?= |\z))?+(?: *+Von(?= |\z))?+(?: *+Neumann(?= |\z))?+ *+\z

For clarity (leading space is significant):

^
(?= *+[^ ])
(?: *+John(?= |\z))?+
(?: *+Von(?= |\z))?+
(?: *+Neumann(?= |\z))?+
 *+
\z

The look-ahead (?= *+[^ ]) at the beginning makes sure the input string does not contain only spaces.

The regex is changed to allow any amount of spaces precede a word (backtracking disallowed by possessive quantifier). 0 or more quantifier * is used, for the case the word is right at the beginning of the string. There is no chance for 2 words to collide, due to the look-ahead assertion (?= |\z).

It still takes up linear space when constructing the string (before entering it to the regex engine). Worst case performance is also linear.


Extreme cases

  1. The original words:

    aaaaaaaaaaaaaaaaaaa0 aaaaaaaaaaaaaaaaaaa1 ... aaaaaaaaaaaaaaaaaaa9 aaaaaaaaaaaaaaaaaaaa ... aaaaaaaaaaaaaaaaaaaz aaaaaaaaaaaaaaaaaaaA ... aaaaaaaaaaaaaaaaaaaZ
    

    (Each word is 20 characters, the last character changes from 0-9, then a-z, then A-Z)

    Strings to search for (not matching):

    aaaaaaaaaaaaaaaaaaaz aaaaaaaaaaaaaaaaaaay
    

    (y can only come before z)

  2. The original word:

    patterns used in Perl pattern matching evolved from those supplied
    

    (Some normal words)

    Strings to search for (not matching):

    patterns used in Perl pattern matching evolved from those suppliedd
    

    (Extra d at the end)

  3. The original word:

    aaaaaaaaaaaa aaaaaaaaaaa aaaaaaaaaa aaaaaaaaa aaaaaaaa aaaaaaa aaaaaa aaaaa aaaa
    

    (Word contains only a, with different length.)

    Strings to search for (not matching):

    aaaaaaaaaaaa aaaaaaaaaaa aaaaaaaaaa aaaaaaaaa aaaaaaaa aaaaaaa aaaaaa aaaaa aaaaa
    

    (Extra a at the end)

  4. The original word:

    performance abc xyz performance 456 !@# performance
    

    (Same word appearing multiple times)

    Strings to search for (not matching):

    performance performance performance performance
    
like image 34
nhahtdh Avatar answered Nov 05 '22 14:11

nhahtdh