I have text in my database in Markdown format. I'd like to extract links and count the number of matching links I have. I can get a listing of text blocks that contain links using a query similar to this:
SELECT post_text
FROM posts p
WHERE p.body like '%\[%](http%)%' ESCAPE '\'
How do I go to the next step though, and just extract the link portion of the text (the part that is in the parenthesis)? If I can get this, I can count the number of times this specific link is in my dataset.
Some sample data:
"Visit [Google](http://google.com)" -> Should return "http://google.com"
"Get an [iPhone](http://www.apple.com) (I like it better than Android)" -> Should return "http://www.apple.com"
"[Example](http://example.com)" -> Should return "http://example.com"
"This is a message" -> Nothing to return on this one, no link
"I like cookies (chocolate chip)" -> Nothing to return on this one, no link
"[Frank] says 'Hello'" -> Nothing to return on this one, no link
I am using SQL Server 2012 (if there are differences between versions in this regard).
Regular expressions are a concise and flexible notation for finding and replacing patterns of text. A specific set of regular expressions can be used in the Find what field of the SQL Server Management Studio Find and Replace dialog box.
tl;dr non-capturing groups, as the name suggests are the parts of the regex that you do not want to be included in the match and ?: is a way to define a group as being non-capturing. Let's say you have an email address [email protected] . The following regex will create two groups, the id part and @example.com part.
REGEXP is the operator used when performing regular expression pattern matches. RLIKE is the synonym. It also supports a number of metacharacters which allow more flexibility and control when performing pattern matching. The backslash is used as an escape character.
Assuming the actual data is no more complex than the stated examples, this should work without resorting to RegEx:
DECLARE @posts TABLE
(
post_id INT NOT NULL IDENTITY(1, 1),
post_text NVARCHAR(4000) NOT NULL,
body NVARCHAR(2048) NULL
);
INSERT INTO @posts (post_text, body) VALUES (N'first',
N'Visit [Google](http://google.com)');
INSERT INTO @posts (post_text, body) VALUES (N'second',
N'Get an [iPhone](http://www.apple.com)');
INSERT INTO @posts (post_text, body) VALUES (N'third',
N'[Example](http://example.com)');
INSERT INTO @posts (post_text, body) VALUES (N'fourth',
N'This is a message');
INSERT INTO @posts (post_text, body) VALUES (N'fifth',
N'I like cookies (chocolate chip)');
INSERT INTO @posts (post_text, body) VALUES (N'sixth',
N'[Frankie] says ''Relax''');
INSERT INTO @posts (post_text, body) VALUES (N'seventh',
NULL);
SELECT p.post_text,
SUBSTRING(
p.body,
CHARINDEX(N'](', p.body) + 2,
CHARINDEX(N')', p.body) - (CHARINDEX(N'](', p.body) + 2)
) AS [URL]
FROM @posts p
WHERE p.body like '%\[%](http%)%' ESCAPE '\';
Output:
post_text URL
first http://google.com
second http://www.apple.com
third http://example.com
PS:
If you really want to use Regular Expressions, they can only be done via SQLCLR. You can write your own or download pre-done libraries. I wrote one such library, SQL#, that has a Free version that includes the RegEx functions. But those should only be used if a T-SQL solution cannot be found, which so far is not the case here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With