Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you combine similar strings for counting purposes in SQL Server

I built a query that finds the longest common substrings of a column and orders them by frequency. The problem I'm having is removing/grouping similar results.

Here's the TOP 5 output from the code below - note how "I love mittens the cat" is the longest, most frequent string, but the code also finds all subsidiaries of that string such as "I love mittens the ca" or "I love mittens the c".

    I love Mittens the cat  3
    I love Mittens the ca   3
     love Mittens the cat   3
     love Mittens the ca    3
    I love Mittens the c    3

If possible, I want to remove any substrings that are similar to others that have partial words. Row 3 would be fine since it's all full words, but rows 4 and 5 should be removed since they're similar to row 1.

DECLARE     @MinLength INT          = 5     --Minimum Substring Length
DECLARE     @MaxLength INT          = 50    --Maximum Substring Length
DECLARE     @Delimeter VARCHAR(5)   = ' '
DECLARE     @T TABLE
            (
                  ID INT IDENTITY
                , chvStrings VARCHAR(64)  
            )
INSERT INTO @T VALUES
            ('I like cats'),
            ('I like dogs'),
            ('cats are great'),
            ('look at that cat'),
            ('I love Mittens the cat'),
            ('I love Mittens the cat a lot'),
            ('I love Mittens the cat so much'),
            ('Dogs are okay, I guess...')

SELECT TOP 10000 
    SUBSTRING(T.chvStrings, N.Number, M.Number) AS Word,
    COUNT(M.number) AS [Count]
FROM        
    @T as T
CROSS APPLY 
    (SELECT N.Number
     FROM [master]..spt_values as N
     WHERE N.type = 'P' 
       AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N
CROSS APPLY 
    (SELECT N.Number
     FROM [master]..spt_values as N
     WHERE N.type = 'P' 
       AND N.number BETWEEN @MinLength AND @MaxLength) M
WHERE       
    N.number <= LEN(t.chvStrings) - M.number + 1
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% '
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%'
    AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR  N.number = 1)
GROUP BY  
    SUBSTRING(T.chvStrings, N.Number, M.Number)                      
ORDER BY    
    COUNT(T.chvStrings) DESC,
    LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC 
like image 299
Fubudis Avatar asked Jan 21 '26 21:01

Fubudis


1 Answers

I have added a couple of extra filters to say that the substring N.Number-1 must not contain letters [a-z0-9], and similarly substring M.Number+1 must not be [a-z0-9].

Is this what you need. Modified code below:

DECLARE     @MinLength INT          = 5     --Minimum Substring Length
DECLARE     @MaxLength INT          = 50    --Maximum Substring Length
DECLARE     @Delimeter VARCHAR(5)   = ' '
DECLARE     @T TABLE
            (
                  ID INT IDENTITY
                , chvStrings VARCHAR(64)  
            )
INSERT INTO @T VALUES
            ('I like cats'),
            ('I like dogs'),
            ('cats are great'),
            ('look at that cat'),
            ('I love Mittens the cat'),
            ('I love Mittens the cat a lot'),
            ('I love Mittens the cat so much'),
            ('Dogs are okay, I guess...')

SELECT TOP 10000 
    SUBSTRING(T.chvStrings, N.Number,  M.Number) AS Word,
    COUNT(M.number) AS [Count]
    --SUBSTRING(T.chvStrings,M.Number+1,1)
FROM        
    @T as T
CROSS APPLY 
    (SELECT N.Number
     FROM [master]..spt_values as N
     WHERE N.type = 'P' 
       AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N
CROSS APPLY 
    (SELECT N.Number
     FROM [master]..spt_values as N
     WHERE N.type = 'P' 
       AND N.number BETWEEN @MinLength AND @MaxLength) M
WHERE       
    N.number <= LEN(t.chvStrings) - M.number + 1
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% '
    AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%'
    AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR  N.number = 1) 
    AND SUBSTRING(T.chvStrings,M.Number+1,1) NOT LIKE '%[a-z0-9]%'
    AND SUBSTRING(T.chvStrings,N.Number-1,1) NOT LIKE '%[a-z0-9]%'
GROUP BY  
    SUBSTRING(T.chvStrings, N.Number, M.Number)                      
ORDER BY    
    COUNT(T.chvStrings) DESC,
    LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC 
like image 191
Thomas Steven Avatar answered Jan 23 '26 09:01

Thomas Steven



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!