In MySQL, when searching for a keyword in a text field where only "whole word match" is desired, one could use REGEXP and the [[:<:]] and [[:>:]] word-boundary markers:
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]word[[:>:]]"
For example, when we want to find all text fields containing "europe", using
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]europe[[:>:]]"
would return "europe map", but not "european union".
However, when the target matching words contains "dot characters", like "u.s.", how should I submit a proper query? I tried the following queries but none of them look correct.
1.
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u.s.[[:>:]]"
2.
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u[.]s[.][[:>:]]"
3.
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\.s\.[[:>:]]"
When using double backslash to escape special characters, as suggested by d'alar'cop, it returns empty, even though there are something like "u.s. congress" in the table
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\\.s\\.[[:>:]]"
Any suggestion is appreciated!
This regex does what you want:
SELECT name
FROM tbl_name
WHERE name REGEXP '([[:blank:][:punct:]]|^)u[.]s[.]([[:punct:][:blank:]]|$)'
This matches u.s.
when preceeded by:
and followed by:
See an SQLFiddle with edge cases covering above points.
The fundamental issue with your predicates is that .
is a non-word character, and any non-word character will cause the word boundary test to fail if they follow a start test or precede an end test. You can see the behavior here.
To further complicate the issue, the flavor of regular expressions used by MySQL is very limited. According to Regular-Expressions.info, MySQL uses POSIX-ERE which if you read the chart at the bottom Regular Expression Flavor Comparisons has very few capabilities where compared to other flavors.
To solve your problem you must create a new regular expression that will replace the functionality of the word boundary so that it will allow non-word characters to be part of the boundary. I came up with the follow Regular Expression:
(^|[^[:alnum:]_])YOUR_TEXT_HERE($|[^[:alnum:]_])
This is equivalent to the standard regular expression below:
(^|[^a-zA-Z0-9_])YOUR_TEXT_HERE($|[^a-zA-Z0-9_])
The regex searches for non-words characters or string boundaries at the start and end of the text. (^|[^[:alnum:]_])
matches either start of string, an alpha-numeric character, or an underscore. The ending pattern is similar except it matches the end of a string instead of the start.
The pattern was designed to best match the definition of word boundaries from Regular Expressions in the MySQL manual:
[Boundaries] match the beginning and end of words, respectively. A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore.
Using the regex above, I came up with a scenario where I test a string that contains non-word characters at the start and end - .u.s.
. I tried to come up with a reasonable set of test items. You can see the results at
SQLFiddle.
Test Data
test string not present: 'no match'
missing .'s: 'no us match'
missing last .: 'no u.s match'
missing first .: 'no us. match'
test start boundary word character: 'no.u.s.match'
test end boundary word character: 'no .u.s.match'
test boundaries word character: 'no.u.s.match'
test basic success case: 'yes .u.s. match'
test start boundary non-word character: 'yes !.u.s. match'
test end boundary non-word character: 'yes .u.s.! match'
test boundaries non-word character: 'yes !.u.s.! match'
test start of line: '.u.s.! yes match'
test end of line: 'yes match .u.s.'
Query
SELECT *
FROM TestRegex
WHERE name REGEXP '(^|[^[:alnum:]_])[.]u[.]s[.]($|[^[:alnum:]_])';
Conclusion
All the positive cases were returned and none of the negative ones => All test cases succeeded.
[.]
for the period character instead of \\.
which I find to be somewhat more readable in the context of a SQL expression.[^a-zA-Z_0-9.!?#$]
.Working example here: http://www.sqlfiddle.com/#!2/5aa90d/9/0
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\\.s\\.([^[:alnum:]]|$)"
Basically saying that u.s. must be followed by anything that isn't an alphanumeric character, or the end of the string.
You could change [:alnum:]
to [:alpha:]
to include results like This is u.s.5 if that's desirable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With