Here's a problem I've repeatedly encountered while playing with the Stack Exchange Data Explorer, which is based on T-SQL: How to search for a string except when it occurs as a substring of some other string? For example, how can I select all records in a table <code>MyTable</code> where the column <code>MyCol</code> contains the string <code>foo</code>, but ignoring any <code>foo</code>s that are part of the string <code>foobar</code>? A quick and dirty attempt would be something like: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT * FROM MyTable WHERE MyCol LIKE '%foo%' AND MyCol NOT LIKE '%foobar%' </code></pre> but obviously this will fail to match e.g. <code>MyCol = 'not all foos are foobars'</code>, which I do want to match. One solution I've come up with is to replace all occurrences of <code>foobar</code> with some dummy marker (that is not a substring of <code>foo</code>) and then checking for any remaining <code>foo</code>s, as in: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT * FROM MyTable WHERE REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%' </code></pre> This works, but I suspect it's not very efficient, since it has to run the <code>REPLACE()</code> on every record in the table. (For SEDE, this would typically be the <code>Posts</code> table, which currently has about 30 million rows.) Are the any better ways to do this? (FWIW, the real use case that prompted this question was searching for SO posts with image URLs that use the <code>http://</code> scheme prefix but do not point to the host <code>i.stack.imgur.com</code>.)

Neither of the ways given so far are guaranteed to work as advertised and only perform the <code>REPLACE</code> on a subset of rows. SQL Server does not guarantee short circuiting of predicates and can move compute scalars up into the underlying query for derived tables and CTEs. The only thing that is (mostly) guaranteed to work is the <code>CASE</code> statement. Below I use the syntactic sugar variety of <code>IIF</code> that expands out to <code>CASE</code> <pre class="prettyprint"><code>SELECT * FROM MyTable WHERE 1 = IIF(MyCol LIKE '%foo%', IIF(REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%', 1, 0), 0); </code></pre>

A three-stage filter should work: <ol> <li>collect all rows matching '%foo%'; </li> <li>replace all instances of 'foobar' with a non-occurring string (such as '' perhaps);</li> <li>Check again for matching '%foo%'</li> </ol> Here you only perform the REPLACE on potentially matching rows, not all rows. If you are expecting only a small percentage of matches, this should be much more efficient. SQL would look like this: <pre class="prettyprint"><code>;with data as ( select * from MyTable where MyCol like '%foo%' ) select * from data where replace(MyCol, 'foobar', 'X') like '%foo%' </code></pre> Note that a sub-query is required, as there are no expression short-cuts in SQL; the engine is free to reorder Boolean terms as desired for efficient processing within a singe query level.

This will be faster than your current query: <pre class="prettyprint"><code>SELECT * FROM MyTable WHERE MyCol like '%foo%' AND REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%' </code></pre> The REPLACE is calculated after MyCol has been applied, so this is faster than just: <pre class="prettyprint"><code>REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%' </code></pre>

T-SQL pattern matching with exceptions

Tags:

sql-server

tsql

dataexplorer

Here's a problem I've repeatedly encountered while playing with the Stack Exchange Data Explorer, which is based on T-SQL:

How to search for a string except when it occurs as a substring of some other string?

For example, how can I select all records in a table MyTable where the column MyCol contains the string foo, but ignoring any foos that are part of the string foobar?

A quick and dirty attempt would be something like:

SELECT * 
FROM MyTable 
WHERE MyCol LIKE '%foo%' 
  AND MyCol NOT LIKE '%foobar%'

but obviously this will fail to match e.g. MyCol = 'not all foos are foobars', which I do want to match.

One solution I've come up with is to replace all occurrences of foobar with some dummy marker (that is not a substring of foo) and then checking for any remaining foos, as in:

SELECT * 
FROM MyTable 
WHERE REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'

This works, but I suspect it's not very efficient, since it has to run the REPLACE() on every record in the table. (For SEDE, this would typically be the Posts table, which currently has about 30 million rows.) Are the any better ways to do this?

(FWIW, the real use case that prompted this question was searching for SO posts with image URLs that use the http:// scheme prefix but do not point to the host i.stack.imgur.com.)

950

asked Feb 01 '16 11:02

Ilmari Karonen

3 Answers

Neither of the ways given so far are guaranteed to work as advertised and only perform the REPLACE on a subset of rows.

SQL Server does not guarantee short circuiting of predicates and can move compute scalars up into the underlying query for derived tables and CTEs.

The only thing that is (mostly) guaranteed to work is the CASE statement. Below I use the syntactic sugar variety of IIF that expands out to CASE

SELECT *
FROM   MyTable
WHERE  1 = IIF(MyCol LIKE '%foo%', 
               IIF(REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%', 1, 0), 
               0);

169

answered Nov 03 '22 04:11

Martin Smith

A three-stage filter should work:

collect all rows matching '%foo%';
replace all instances of 'foobar' with a non-occurring string (such as '' perhaps);
Check again for matching '%foo%'

Here you only perform the REPLACE on potentially matching rows, not all rows. If you are expecting only a small percentage of matches, this should be much more efficient.

SQL would look like this:

;with data as (
    select * 
    from MyTable 
    where MyCol like '%foo%'      
)
select *
from data
where replace(MyCol, 'foobar', 'X') like '%foo%'

Note that a sub-query is required, as there are no expression short-cuts in SQL; the engine is free to reorder Boolean terms as desired for efficient processing within a singe query level.

answered Nov 03 '22 04:11

Pieter Geerkens

This will be faster than your current query:

SELECT * 
FROM MyTable 
WHERE 
  MyCol like '%foo%' AND
  REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'

The REPLACE is calculated after MyCol has been applied, so this is faster than just:

REPLACE(MyCol, 'foobar', 'X') LIKE '%foo%'

answered Nov 03 '22 03:11

t-clausen.dk

Related questions
                            
                                This driver is not configured for integrated authentication
                            
                                Best alternative to WITH(NOLOCK)?
                            
                                Eager Loading with join
                            
                                C# How to drop connection when detaching database
                            
                                Difference between "is not null" and "<> Null" in SQL Server? [duplicate]
                            
                                SQL Sequential Grouping and strings for sequence gaps
                            
                                How come SAS, Proc SQL, doesn't throw an error when I CREATE already existing tables
                            
                                Get return value from stored procedure using ExecuteSqlCommand (using Entity Framework)
                            
                                Error of SQL Server 2016 sp_execute_external_script with R integration
                            
                                What is faster: SUM over NULL or over 0?
                            
                                SQL JOIN WITH OR Condition
                            
                                Set DSN encoding for ODBC Driver 11 for SQL Server on Windows 10
                            
                                Does SSRS run multiple queries at once?
                            
                                SQL Server Job won't recognize AWS CLI command
                            
                                How to delete old aspnet users with aspnet_Users_DeleteUser procedure?
                            
                                Remote connection to MS SQL - Error using pyodbc vs success using SQL Server Management Studio
                            
                                Sql Server ODBC Date Field - Optional feature not implemented
                            
                                Visual studio team services online - running a sql script
                            
                                How to create stored procedure in C#, then *save* it to SQL Server?
                            
                                Poor performance of SQL query with Table Variable or User Defined Type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With