Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you parse an HTML in vb.net

I would like to know if there is a simple way to parse HTML in vb.net. I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way in VB.net?

like image 420
tooleb Avatar asked Feb 05 '09 16:02

tooleb


People also ask

How do you parse HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

What is parsing an HTML?

Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run, for example the JavaScript engine inside browsers. The browser parses HTML into a DOM tree. HTML parsing involves tokenization and tree construction.

Can we parse HTML?

Which means that you can parse HTML documents after they have been modified by JavaScript. Both the JavaScript included in the page or a script you add yourself. The following example, from the documentation, shows a few features of AngleSharp.


4 Answers

'add prog ref too: Microsoft.mshtml

'then on the page:

Imports mshtml

Function parseMyHtml(ByVal htmlToParse$) As String
    Dim htmlDocument As IHTMLDocument2 = New HTMLDocumentClass()
    htmlDocument.write(htmlToParse)
    htmlDocument.close()

    Dim allElements As IHTMLElementCollection = htmlDocument.body.all

    Dim allInputs As IHTMLElementCollection = allElements.tags("a")
    Dim element As IHTMLElement
    For Each element In allInputs
        element.title = element.innerText
    Next

    Return htmlDocument.body.innerHTML
End Function

As found here:

like image 123
TripleHelix Tech Avatar answered Sep 19 '22 01:09

TripleHelix Tech


I like Html Agility pack - it's very developer friendly, free and source code is available.

like image 35
TcKs Avatar answered Sep 21 '22 01:09

TcKs


Don't use agility pack, just use mshtml library to access the dom, this is what ie uses and is great for going through HTML elements.

Agility pack is nasty and unnecessarily hackie if you ask me, mshtml is the way to go. Look it up on msdn.

like image 32
Erx_VB.NExT.Coder Avatar answered Sep 22 '22 01:09

Erx_VB.NExT.Coder


If your HTML follows XHTML standards, you can do a lot of the parsing and processing using the System.XML namespace classes.

If, on the other hand, if what you're parsing is what web developers refer to as "tag soup," you'll need a third-party parser like HTML Agility Pack.

This may be only a partial solution to your problem if you're trying to figure out how a browser will interpret your HTML as each browser parses tag soup slightly differently.

like image 26
Yes - that Jake. Avatar answered Sep 21 '22 01:09

Yes - that Jake.