Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to extract attribute value

Tags:

html

c#

regex

What would be a quick way to extract the value of the title attributes for an HTML table:

...
<li><a href="/wiki/Proclo" title="Proclo">Proclo</a></li>
<li><a href="/wiki/Proclus" title="Proclus">Proclus</a></li>
<li><a href="/wiki/Ptolemy" title="Ptolemy">Ptolemy</a></li>
<li><a href="/wiki/Pythagoras" title="Pythagoras">Pythagoras</a></li></ul><h3>S</h3>
...

so it would return Proclo, Proclus, Ptolemy, Pythagoras,.... in strings for each line. I'm reading the file using a StreamReader. I'm using C#.

Thank you.

like image 245
al1 Avatar asked Apr 02 '11 21:04

al1


2 Answers

This C# regex will find all title values:

(?<=\btitle=")[^"]*

The C# code is like this:

Regex regex = new Regex(@"(?<=\btitle="")[^""]*");
Match match = regex.Match(input);
string title = match.Value;

The regex uses positive lookbehind to find the position where the title value starts. It then matches everything up to the ending double quote.

like image 130
Staffan Nöteberg Avatar answered Nov 15 '22 15:11

Staffan Nöteberg


Use the regexp below

title="([^"]+)"

and then use Groups to browse through matched elements.

EDIT: I have modified the regexp to cover the examples provided in comment by @Staffan Nöteberg

like image 38
MPękalski Avatar answered Nov 15 '22 16:11

MPękalski