Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whole word matching with dot characters in MySQL

Tags:

regex

mysql

In MySQL, when searching for a keyword in a text field where only "whole word match" is desired, one could use REGEXP and the [[:<:]] and [[:>:]] word-boundary markers:

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]word[[:>:]]"

For example, when we want to find all text fields containing "europe", using

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]europe[[:>:]]"

would return "europe map", but not "european union".

However, when the target matching words contains "dot characters", like "u.s.", how should I submit a proper query? I tried the following queries but none of them look correct.

1.

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u.s.[[:>:]]"

2.

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u[.]s[.][[:>:]]"

3.

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\.s\.[[:>:]]"

When using double backslash to escape special characters, as suggested by d'alar'cop, it returns empty, even though there are something like "u.s. congress" in the table

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\\.s\\.[[:>:]]"

Any suggestion is appreciated!

like image 423
user1036719 Avatar asked Aug 09 '13 19:08

user1036719


3 Answers

This regex does what you want:

SELECT name
FROM tbl_name
WHERE name REGEXP '([[:blank:][:punct:]]|^)u[.]s[.]([[:punct:][:blank:]]|$)'

This matches u.s. when preceeded by:

  • a blank (space, tab etc)
  • punctuation (comma, bracket etc)
  • nothing (ie at start of line)

and followed by:

  • a blank (space, tab etc)
  • punctuation (comma, bracket etc)
  • nothing (ie at end of line)

See an SQLFiddle with edge cases covering above points.

like image 196
Bohemian Avatar answered Nov 04 '22 04:11

Bohemian


The fundamental issue with your predicates is that . is a non-word character, and any non-word character will cause the word boundary test to fail if they follow a start test or precede an end test. You can see the behavior here.

To further complicate the issue, the flavor of regular expressions used by MySQL is very limited. According to Regular-Expressions.info, MySQL uses POSIX-ERE which if you read the chart at the bottom Regular Expression Flavor Comparisons has very few capabilities where compared to other flavors.

To solve your problem you must create a new regular expression that will replace the functionality of the word boundary so that it will allow non-word characters to be part of the boundary. I came up with the follow Regular Expression:

(^|[^[:alnum:]_])YOUR_TEXT_HERE($|[^[:alnum:]_])

This is equivalent to the standard regular expression below:

(^|[^a-zA-Z0-9_])YOUR_TEXT_HERE($|[^a-zA-Z0-9_])

The regex searches for non-words characters or string boundaries at the start and end of the text. (^|[^[:alnum:]_]) matches either start of string, an alpha-numeric character, or an underscore. The ending pattern is similar except it matches the end of a string instead of the start.

The pattern was designed to best match the definition of word boundaries from Regular Expressions in the MySQL manual:

[Boundaries] match the beginning and end of words, respectively. A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore.

Test Results

Using the regex above, I came up with a scenario where I test a string that contains non-word characters at the start and end - .u.s.. I tried to come up with a reasonable set of test items. You can see the results at SQLFiddle.

Test Data

test string not present:                 'no match' 
missing .'s:                             'no us match' 
missing last .:                          'no u.s match' 
missing first .:                         'no us. match' 
test start boundary word character:      'no.u.s.match'   
test end boundary word character:        'no .u.s.match'   
test boundaries word character:          'no.u.s.match'   
test basic success case:                 'yes .u.s. match' 
test start boundary non-word character:  'yes !.u.s. match'   
test end boundary non-word character:    'yes .u.s.! match'   
test boundaries non-word character:      'yes !.u.s.! match' 
test start of line:                      '.u.s.! yes match'   
test end of line:                        'yes match .u.s.' 

Query

SELECT *
FROM TestRegex
WHERE name REGEXP '(^|[^[:alnum:]_])[.]u[.]s[.]($|[^[:alnum:]_])'; 

SQLFiddle

Conclusion

All the positive cases were returned and none of the negative ones => All test cases succeeded.

  • You can use [.] for the period character instead of \\. which I find to be somewhat more readable in the context of a SQL expression.
  • You can adjust the sets used to define the boundary to be more or less restrictive depending on your desires. For example you can restrict some non-word characters as well: [^a-zA-Z_0-9.!?#$].
like image 22
Daniel Gimenez Avatar answered Nov 04 '22 05:11

Daniel Gimenez


Working example here: http://www.sqlfiddle.com/#!2/5aa90d/9/0

SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\\.s\\.([^[:alnum:]]|$)"

Basically saying that u.s. must be followed by anything that isn't an alphanumeric character, or the end of the string.

You could change [:alnum:] to [:alpha:] to include results like This is u.s.5 if that's desirable.

like image 22
asontu Avatar answered Nov 04 '22 04:11

asontu