Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse local HTML file

I can use PowerShell to parse an HTML page

PS > $foo = Invoke-WebRequest http://example.com

PS > $foo.Links.Count
1

However if I download the page

PS > Invoke-WebRequest -OutFile example.htm http://example.com

and then try to parse the downloaded page it gives unexpected result

PS > $foo = Invoke-WebRequest file://$pwd/example.htm

PS > $foo.Links.Count
0

How can I parse the local downloaded page?

like image 218
Zombo Avatar asked Jul 27 '14 01:07

Zombo


People also ask

How do I view an HTML file locally?

Most computers will associate your default browser with the . html file extension. That means that normally, you don't need to find a browser to open the file—you can just double-click on it, and the computer will open it in your default web browser.

Can HTML be parsed?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.


2 Answers

It appears that Invoke-WebRequest loads file protocol URIs just fine, but fails to parse them even in PowerShell 4.0 (where it is officially supported).

An alternative that does not require setting up a website would be to load and parse HTML directly into MSHTML.

$html = New-Object -ComObject "HTMLFile";
$source = Get-Content -Path "file.html" -Raw;
$html.IHTMLDocument2_write($source);

$html.links.length;

Note that when I tested this, a single

<meta http-equiv="X-UA-Compatible" content="IE=edge" />

header prevented my HTML from parsing and I have no idea why -- the document had similar XHTML-style headers and MSHTML had no issues with those.

like image 147
PeterK Avatar answered Sep 25 '22 20:09

PeterK


You can use the file with a web server to get around the dumb limitation of Invoke-WebRequest

PS > $foo = Invoke-WebRequest http://localhost:8080/example.htm

PS > $foo.Links.Count
1

Note this will work even with no connection, example

PS > Invoke-WebRequest http://example.com
Invoke-WebRequest : The remote name could not be resolved: 'example.com'
like image 40
Zombo Avatar answered Sep 22 '22 20:09

Zombo