Having a bit of a brain freeze here so I was hoping for some pointers, essentially I need to extract the contents of a specific div tag, yes I know that regex usually isn't approved of for this but its a simple web scraping application where there are no nested div's.
I'm trying to match this:
<div class="entry">
<span class="title">Some company</span>
<span class="description">
<strong>Address: </strong>Some address
<br /><strong>Telephone: </strong> 01908 12345
</span>
</div>
simple vb code is as follows:
Dim myMatches As MatchCollection
Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline)
Dim wc As New WebClient
Dim html As String = wc.DownloadString("http://somewebaddress.com")
RichTextBox1.Text = html
myMatches = myRegex.Matches(html)
MsgBox(html)
'Search for all the words in a string
Dim successfulMatch As Match
For Each successfulMatch In myMatches
MsgBox(successfulMatch.Groups(1).ToString)
Next
Any help would be greatly appreciated.
Your regex works for your example. There are some improvements that should be made, though:
<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>
[^<>]*
means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.
.*?
(note the ?
) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry">
tag in your page.
But your regex itself should still have matched something. Perhaps you're not using it correctly?
I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:
Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
ResultList.Add(MatchResult.Groups("content").Value)
MatchResult = MatchResult.NextMatch()
End While
I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div
's contents never varies:
<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>
or (behold the joy of multiline strings in VB.NET):
Dim RegexObj As New Regex(
"<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
"<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
"(?<title>.*?)" & chr(10) & _
"\s*</span>\s*" & chr(10) & _
"<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
"<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
"(?<address>.*?)" & chr(10) & _
"\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
"(?<phone>.*?)" & chr(10) & _
"\s*</span>\s*</div>",
RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)
(Of course, now you need to store the results for MatchResult.Groups("title")
etc...)
Try using RegexOptions.Multiline
instead of RegexOptions.Singleline
Thanks to @Tim for pointing out that the above doesn't work... my bad.
@Tim's answer is a good one, and should be the accepted answer, but an extra part that is stopping your code from working is that there is no 2nd group for Group(1)
to return.
Change...
MsgBox(successfulMatch.Groups(1).ToString)
To...
MsgBox(successfulMatch.Groups(0).ToString)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With