Here's a problem I've repeatedly encountered while playing with the Stack Exchange Data Explorer, which is based on T-SQL:
How to search for a string except when it occurs as a substring of some other string?
For example, how can I select all records in a table MyTable
where the column MyCol
contains the string foo
, but ignoring any foo
s that are part of the string foobar
?
A quick and dirty attempt would be something like:
SELECT *
FROM MyTable
WHERE MyCol LIKE '%foo%'
AND MyCol NOT LIKE '%foobar%'
but obviously this will fail to match e.g. MyCol = 'not all foos are foobars'
, which I do want to match.
One solution I've come up with is to replace all occurrences of foobar
with some dummy marker (that is not a substring of foo
) and then checking for any remaining foo
s, as in:
SELECT *
FROM MyTable
WHERE REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'
This works, but I suspect it's not very efficient, since it has to run the REPLACE()
on every record in the table. (For SEDE, this would typically be the Posts
table, which currently has about 30 million rows.) Are the any better ways to do this?
(FWIW, the real use case that prompted this question was searching for SO posts with image URLs that use the http://
scheme prefix but do not point to the host i.stack.imgur.com
.)
SQL pattern matching enables you to use _ to match any single character and % to match an arbitrary number of characters (including zero characters). In MySQL, SQL patterns are case-insensitive by default.
SQL has a standard pattern matching technique using the 'LIKE' operator. But, it also supports the regular expression pattern matching for better functionality. Generally, the REGEXP_LIKE(column_name, 'regex') function is used for pattern matching in SQL.
It is commonly used in a Where clause to search for a specified pattern in a column. This operator can be useful in cases when we need to perform pattern matching instead of equal or not equal. The SQL Like is used when we want to return the row if specific character string matches a specified pattern.
Neither of the ways given so far are guaranteed to work as advertised and only perform the REPLACE
on a subset of rows.
SQL Server does not guarantee short circuiting of predicates and can move compute scalars up into the underlying query for derived tables and CTEs.
The only thing that is (mostly) guaranteed to work is the CASE
statement. Below I use the syntactic sugar variety of IIF
that expands out to CASE
SELECT *
FROM MyTable
WHERE 1 = IIF(MyCol LIKE '%foo%',
IIF(REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%', 1, 0),
0);
A three-stage filter should work:
collect all rows matching '%foo%';
replace all instances of 'foobar' with a non-occurring string (such as '' perhaps);
Check again for matching '%foo%'
Here you only perform the REPLACE on potentially matching rows, not all rows. If you are expecting only a small percentage of matches, this should be much more efficient.
SQL would look like this:
;with data as (
select *
from MyTable
where MyCol like '%foo%'
)
select *
from data
where replace(MyCol, 'foobar', 'X') like '%foo%'
Note that a sub-query is required, as there are no expression short-cuts in SQL; the engine is free to reorder Boolean terms as desired for efficient processing within a singe query level.
This will be faster than your current query:
SELECT *
FROM MyTable
WHERE
MyCol like '%foo%' AND
REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'
The REPLACE is calculated after MyCol has been applied, so this is faster than just:
REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With