We have a cms system that write html content blocks into sql server database. I know the table name and field name where these html content blocks reside. Some html contains links () to pdf files. Here is a fragment:
<p>A deferred tuition payment plan,
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>
I need to extract pdf file names from all such html content blocks. At the end I need to get a list:
Tuition-Reimbursement-Deferred.pdf
Some-other-file.pdf
of all pdf file names from that field.
Any help is appreciated. Thanks.
UPDATE
I have received many replies, thank you so much, but I forgot to mention that we are still using SQL Server 2000 here. So, this had to be done using SQL 2000 SQL.
Create this function:
create function dbo.extract_filenames_from_a_tags (@s nvarchar(max))
returns @res table (pdf nvarchar(max)) as
begin
-- assumes there are no single quotes or double quotes in the PDF filename
declare @i int, @j int, @k int, @tmp nvarchar(max);
set @i = charindex(N'.pdf', @s);
while @i > 0
begin
select @tmp = left(@s, @i+3);
select @j = charindex('/', reverse(@tmp)); -- directory delimiter
select @k = charindex('"', reverse(@tmp)); -- start of href
if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
select @k = charindex('''', reverse(@tmp)); -- start of href (single-quote*)
if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
insert @res values (substring(@tmp, len(@tmp)-@j+2, len(@tmp)));
select @s = stuff(@s, 1, @i+4, ''); -- remove up to ".pdf"
set @i = charindex(N'.pdf', @s);
end
return
end
GO
A demo on using that function:
declare @t table (html varchar(max));
insert @t values
('
<p>A deferred tuition payment plan,
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'),
('
<p>A deferred tuition payment plan,
or view the <a href="Two files here-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>And I use single quotes
<a href=''/look/path/The second file.pdf''
target="_blank">list</a>');
select t.*, p.pdf
from @t t
cross apply dbo.extract_filenames_from_a_tags(html) p;
Results:
|HTML | PDF |
--------------------------------------------------------------------
|<p>A deferred tui.... | Tuition-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... | Two files here-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... | The second file.pdf |
SQL Fiddle Demo
Well it's not pretty but this works using standard Transact-SQL:
SELECT CASE WHEN CHARINDEX('.pdf', html) > 0
THEN SUBSTRING(
html,
CHARINDEX('.pdf', html) -
PATINDEX(
'%["/]%',
REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 1,
PATINDEX(
'%["/]%',
REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 3)
ELSE NULL
END AS filename
FROM mytable
Could expand the list of delimiting characters before the filename from ["/]
(which matches either a quotation mark or slash) if you like.
See SQL Fiddle demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With