Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a web page into a HTML file with PowerShell or C#?

I have the following link , and when I open the link via Chrome and then right-click the page and then choose "save as" to save the page into a HTML file (c:\temp\cu2.html)

enter image description here

After it is saved, I can open this cu2.html file with an HTML editor (say VS2015), and I can see inside the file, there is tag as seen below

enter image description here

However, if I open the link with IE11 (instead of Chrome), and then save the same page as HTML file, I cannot find this tag at all. Actually, the html file saved from IE11 is the same content as what I can extract with PowerShell script below.

#Requires -version 4.0
$url = 'https://support.microsoft.com/en-us/help/4052574/cumulative-update-2-for-sql-server-2017';

$wr = Invoke-WebRequest $url;
$wr.RawContent.contains('<table') # returns false

$wr.RawContent | out-file -FilePath c:\temp\cu2_ps.html -Force; #same as the file saved from the webpage to html file in IE

So my question is:

Why is a web page saved (as html file) in Chrome is different from that in IE?

How can I use PowerShell(or C#) to save such web page into a HTML file (same as the file saved in Chrome)?

Thanks in advance for your help.

like image 779
jyao Avatar asked Dec 01 '17 06:12

jyao


People also ask

How do I save a webpage as a HTML file?

Press CTRL+S. Right-click within the HTML document, click File > Save.

How do I save PowerShell output to HTML?

PowerShell provides a built-in cmdlet called ConvertTo-Html. This takes objects as input and converts each of them to an HTML web page. To use this, just take the output and pipe it directly to ConvertTo-Html. The cmdlet will then return a big string of HTML.


1 Answers

The pages uses AngularJS and also jQuery. It means some contents will be loaded after document ready. So when you send the request using Invoke-WebRequest, you only receive the original content of the page. Other contents will be loaded after a while.

To solve the problem, you can automate IE to get expected result. It's enough to wait fr the page to get ready and also wait a bit to run AngularJs logic and download required content, then get content of document element:

$ie = new-object -ComObject "InternetExplorer.Application"
$url = "https://support.microsoft.com/en-us/help/4052574/cumulative-update-2-for-sql-server-2017"
$ie.silent = $true
$ie.navigate($url)
while($ie.Busy) { Start-Sleep -Milliseconds 100 }
Start-Sleep 10
$ie.Document.documentElement.innerHTML > "C:\Tempfiles\output.html"
$ie.Stop()
$ie.Quit()
like image 106
Reza Aghaei Avatar answered Oct 21 '22 21:10

Reza Aghaei