In this html code :
<div id="ajaxWarningRegion" class="infoFont"></div>
<span id="ajaxStatusRegion"></span>
<form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" >
<pre>
Creating a new ZIP of IP Phone files from HTTP/PhoneBackup
and HTTPS/PhoneBackup
</pre>
<pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
<pre>Reports Success</pre>
<pre></pre>
<a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
Download the new ZIP of IP Phone files
</a>
</div>
I want to retrieve the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip
or just the date and hour between IP_PHONE_BACKUP-
and .zip
How can I do that ?
What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:
Select-NodeContent $doc.DocumentNode "//a/@href"
And this one extracts the desired substring:
Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"
The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:
With those requirements satisfied you can add the HTMLAgilityPath
type to your environment and define the Select-NodeContent
function, both shown below. The very end of the code shows how you assign a value to the $doc
variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.
Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath
function Select-NodeContent(
[HtmlAgilityPack.HtmlNode]$node,
[string] $xpath,
[string] $regex,
[Object] $default = "")
{
if ($xpath -match "(.*)/@(\w+)$") {
# If standard XPath to retrieve an attribute is given,
# map to supported operations to retrieve the attribute's text.
($xpath, $attribute) = $matches[1], $matches[2]
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
}
else { # retrieve an element's text
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.InnerText } { $default }
}
# If a regex is given, use it to extract a substring from the text
if ($regex) {
if ($text -match $regex) { $text = $matches[1] }
else { $text = $default }
}
return $text
}
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With