Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating an inline table function to strip HTML

I'm trying to create an inline table function to strip out HTML (so that I can CROSS APPLY it with another table). We have a scalar function in place, but the performance is just not good enough.

I'm really new to using T-SQL, so I don't know that many good practices of it, so if anyone has any suggestions, please let me know!

This is what I have so far:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE FUNCTION StripHTML
(   
    @text VARCHAR(MAX)
)
RETURNS TABLE 
AS
RETURN
    WITH BracketIndexes (startIndex, stopIndex) AS (SELECT
            CHARINDEX('<', @text),
            CHARINDEX('>', @text) UNION ALL SELECT
            CHARINDEX('<', @text, stopIndex + 1),
            CHARINDEX('>', @text, stopIndex + 1)
        FROM BracketIndexes
        WHERE startIndex > 0 AND stopIndex > 0
    )
    SELECT * FROM BracketIndexes

Now that I have the indices of where the <s and > are, I'm not sure how to:

  • remove the contents between those indices
  • return the string so I can cross apply it

I'm thinking I can just concat the left side and right side of each bracket using LEFT/RIGHT, but I can't seem to get the indices correct.

Edit:

This kinda worked:

DECLARE @text VARCHAR(100) = 'HELLO <B>WORLD</B> asd';

WITH BracketIndexes (startIndex, stopIndex) AS (
    SELECT
        CHARINDEX('<', @text),
        CHARINDEX('>', @text)
    UNION ALL
    SELECT
        CHARINDEX('<', @text, stopIndex + 1) - (stopIndex - startIndex + 1),
        CHARINDEX('>', @text, stopIndex + 1) - (stopIndex - startIndex + 1)
    FROM BracketIndexes
    WHERE startIndex > 0 AND stopIndex > 0
)
SELECT @text = LEFT(@text, startIndex - 1) + RIGHT(@text, LEN(@text) - stopIndex)FROM BracketIndexes WHERE startIndex > 0 AND stopIndex > 0
SELECT @text

But when when I added some more HTML statements into @text, the RIGHT had an incorrect length parameter.


1 Answers

You were on the right path. With each recursion of the CTE, you want to basically replace one HTML element with a blank using a combination of REPLACE() and SUBSTRING():

DECLARE @text varchar(max) = '<a href="">Cool!</a><b>oh yeah!</b>'  

;WITH BracketIndexes (rowNumber, original, startIndex, stopIndex, replaced) AS (SELECT
        1,
        @text,
        CHARINDEX('<', @text),
        CHARINDEX('>', @text),
        REPLACE(@text,SUBSTRING(@text,CHARINDEX('<', @text),CHARINDEX('>', @text)),'') UNION ALL SELECT
        rowNumber+1,
        replaced,
        CHARINDEX('<', replaced),
        CHARINDEX('>', replaced),
        REPLACE(replaced,SUBSTRING(replaced,CHARINDEX('<', replaced),CHARINDEX('>', replaced)),'')
    FROM BracketIndexes
    WHERE startIndex > 0 AND stopIndex > 0
)
SELECT TOP 1
    replaced
FROM 
    BracketIndexes
ORDER BY RowNumber DESC
like image 52
Bert Wagner Avatar answered Feb 21 '26 08:02

Bert Wagner