Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract file names from a field that contains html content in sql server?

We have a cms system that write html content blocks into sql server database. I know the table name and field name where these html content blocks reside. Some html contains links () to pdf files. Here is a fragment:

<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>

I need to extract pdf file names from all such html content blocks. At the end I need to get a list:

Tuition-Reimbursement-Deferred.pdf
Some-other-file.pdf

of all pdf file names from that field.

Any help is appreciated. Thanks.

UPDATE

I have received many replies, thank you so much, but I forgot to mention that we are still using SQL Server 2000 here. So, this had to be done using SQL 2000 SQL.

like image 469
monstro Avatar asked Apr 25 '13 21:04

monstro


2 Answers

Create this function:

create function dbo.extract_filenames_from_a_tags (@s nvarchar(max))
returns @res table (pdf nvarchar(max)) as
begin
-- assumes there are no single quotes or double quotes in the PDF filename
declare @i int, @j int, @k int, @tmp nvarchar(max);
set @i = charindex(N'.pdf', @s);
while @i > 0
begin
  select @tmp = left(@s, @i+3);
  select @j = charindex('/', reverse(@tmp)); -- directory delimiter
  select @k = charindex('"', reverse(@tmp)); -- start of href
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  select @k = charindex('''', reverse(@tmp)); -- start of href (single-quote*)
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  insert @res values (substring(@tmp, len(@tmp)-@j+2, len(@tmp)));
  select @s = stuff(@s, 1, @i+4, ''); -- remove up to ".pdf"
  set @i = charindex(N'.pdf', @s);
end
return
end
GO

A demo on using that function:

declare @t table (html varchar(max));
insert @t values
  ('
<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'),
  ('
<p>A deferred tuition payment plan, 
or view the <a href="Two files here-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>And I use single quotes
   <a href=''/look/path/The second file.pdf''
target="_blank">list</a>');

select t.*, p.pdf
from @t t
cross apply dbo.extract_filenames_from_a_tags(html) p;

Results:

|HTML                  |                                       PDF |
--------------------------------------------------------------------
|<p>A deferred tui.... |        Tuition-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... | Two files here-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... |                       The second file.pdf |

SQL Fiddle Demo

like image 198
RichardTheKiwi Avatar answered Sep 22 '22 18:09

RichardTheKiwi


Well it's not pretty but this works using standard Transact-SQL:

SELECT CASE WHEN CHARINDEX('.pdf', html) > 0
            THEN SUBSTRING(
                     html,
                     CHARINDEX('.pdf', html) -
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 1,
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 3)
            ELSE NULL
       END AS filename
FROM mytable

Could expand the list of delimiting characters before the filename from ["/] (which matches either a quotation mark or slash) if you like.

See SQL Fiddle demo

like image 38
Steve Chambers Avatar answered Sep 22 '22 18:09

Steve Chambers