Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression Wildcard Matching

Tags:

java

regex

I have a list of about 120 thousand english words (basically every word in the language).

I need a regular expression that would allow searching through these words using wildcards characters, a.k.a. * and ?.

A few examples:

  • if the user searches for m?st*, it would match for example master or mister or mistery.
  • if the user searches for *ind (any word ending in ind), it would match wind or bind or blind or grind.

Now, most users (especially the ones who are not familiar with regular expressions) know that ? is a replacement for exactly 1 character, while * is a replacement for 0, 1 or more characters. I absolutely want to build my search feature based on this.

My questions is: How do I convert what the user types (m?st* for example) to a regular expression ?

I searched the web (obviously including this website) and all I could find were tutorials that tried to teach me too much or questions that were somewhat similar, but not enough as to provide an answer to my own problem.

All I could figure out was that I have to replace ? with .. So m?st* becomes m.st*. However, I have no idea what to replace * with.

Any help would be greatly appreciated. Thank you.

PS: I'm totally new to regular expressions. I know how powerful they can be, but I also know they can be very hard to learn. So I just never took the time do to it...

like image 845
Radu Murzea Avatar asked May 09 '12 16:05

Radu Murzea


2 Answers

Unless you want some funny behaviour, I would recommend you use \w instead of .

. matches whitespace and other non-word symbols, which you might not want it to do.

So I would replace ? with \w and replace * with \w*

Also if you want * to match at least one character, replace it with \w+ instead. This would mean that ben* would match bend and bending but not ben - it's up to you, just depends what your requirements are.

like image 183
gnomed Avatar answered Sep 23 '22 05:09

gnomed


Take a look at this library: https://github.com/alenon/JWildcard

It wraps all not wildcard specific parts by regex quotes, so no special chars processing needed: This wildcard:

"mywil?card*" 

will be converted to this regex string:

"\Qmywil\E.\Qcard\E.*" 

If you wish to convert wildcard to regex string use:

JWildcard.wildcardToRegex("mywil?card*"); 

If you wish to check the matching directly you can use this:

JWildcard.matches("mywild*", "mywildcard"); 

Default wildcard rules are "?" -> ".", "" -> ".", but you can change the default behaviour if you wish, by simply defining the new rules.

JWildcard.wildcardToRegex(wildcard, rules, strict); 

You can use sources or download it directly using maven or gradle from Bintray JCenter: https://bintray.com/yevdo/jwildcard/jwildcard

Gradle way:

compile 'com.yevdo:jwildcard:1.4' 

Maven way:

<dependency>   <groupId>com.yevdo</groupId>   <artifactId>jwildcard</artifactId>   <version>1.4</version> </dependency> 
like image 36
lenon Avatar answered Sep 21 '22 05:09

lenon