Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server 2012 : extract Regex groups

I have text in my database in Markdown format. I'd like to extract links and count the number of matching links I have. I can get a listing of text blocks that contain links using a query similar to this:

SELECT post_text
FROM posts p
WHERE p.body like '%\[%](http%)%' ESCAPE '\'

How do I go to the next step though, and just extract the link portion of the text (the part that is in the parenthesis)? If I can get this, I can count the number of times this specific link is in my dataset.

Some sample data:

"Visit [Google](http://google.com)"    -> Should return "http://google.com"
"Get an [iPhone](http://www.apple.com) (I like it better than Android)"   -> Should return "http://www.apple.com"
"[Example](http://example.com)"    -> Should return "http://example.com"
"This is a message"    -> Nothing to return on this one, no link
"I like cookies (chocolate chip)"  -> Nothing to return on this one, no link
"[Frank] says 'Hello'" -> Nothing to return on this one, no link

I am using SQL Server 2012 (if there are differences between versions in this regard).

like image 625
Andy Avatar asked Sep 30 '14 19:09

Andy


People also ask

Does SQL Server support RegEx?

Regular expressions are a concise and flexible notation for finding and replacing patterns of text. A specific set of regular expressions can be used in the Find what field of the SQL Server Management Studio Find and Replace dialog box.

What is non capturing group in RegEx?

tl;dr non-capturing groups, as the name suggests are the parts of the regex that you do not want to be included in the match and ?: is a way to define a group as being non-capturing. Let's say you have an email address [email protected] . The following regex will create two groups, the id part and @example.com part.

How does regexp work in SQL?

REGEXP is the operator used when performing regular expression pattern matches. RLIKE is the synonym. It also supports a number of metacharacters which allow more flexibility and control when performing pattern matching. The backslash is used as an escape character.


1 Answers

Assuming the actual data is no more complex than the stated examples, this should work without resorting to RegEx:

DECLARE @posts TABLE
(
   post_id INT NOT NULL IDENTITY(1, 1),
   post_text NVARCHAR(4000) NOT NULL,
   body NVARCHAR(2048) NULL
);
INSERT INTO @posts (post_text, body) VALUES (N'first',
                                           N'Visit [Google](http://google.com)');
INSERT INTO @posts (post_text, body) VALUES (N'second',
                                           N'Get an [iPhone](http://www.apple.com)');
INSERT INTO @posts (post_text, body) VALUES (N'third',
                                           N'[Example](http://example.com)');
INSERT INTO @posts (post_text, body) VALUES (N'fourth',
                                           N'This is a message');
INSERT INTO @posts (post_text, body) VALUES (N'fifth',
                                           N'I like cookies (chocolate chip)');
INSERT INTO @posts (post_text, body) VALUES (N'sixth',
                                           N'[Frankie] says ''Relax''');
INSERT INTO @posts (post_text, body) VALUES (N'seventh',
                                           NULL);


SELECT p.post_text,
       SUBSTRING(
                  p.body,
                  CHARINDEX(N'](', p.body) + 2,
                  CHARINDEX(N')', p.body) - (CHARINDEX(N'](', p.body) + 2)
                ) AS [URL]
FROM   @posts p
WHERE  p.body like '%\[%](http%)%' ESCAPE '\';

Output:

post_text  URL
first      http://google.com
second     http://www.apple.com
third      http://example.com

PS:
If you really want to use Regular Expressions, they can only be done via SQLCLR. You can write your own or download pre-done libraries. I wrote one such library, SQL#, that has a Free version that includes the RegEx functions. But those should only be used if a T-SQL solution cannot be found, which so far is not the case here.

like image 162
Solomon Rutzky Avatar answered Oct 27 '22 00:10

Solomon Rutzky