I make use of the PAL tool (https://pal.codeplex.com/) to generate HTML reports from perfmon logs within Windows. After PAL processes .blg files from perfmon it dumps the information into an HTML document that contains tables with various data points about how the system performed. I am currently writing a script that looks at the contents of a directory for all HTML files, and does a get-content on all the HTML files.
What I would like to do is scrape the dump of this get-content blob for specific tables that have varying amount of rows. Is it possible using native powershell cmdlets to look for specific tables, count how many rows are in each table, and dump just the desired tables and table rows?
Here is an example of the table format I'm trying to scrape:
<H3>Overall Counter Instance Statistics</H3>
<TABLE ID="table6" BORDER=1 CELLPADDING=2>
<TR><TH><B>Condition</B></TH><TH><B>\LogicalDisk(*)\Disk Transfers/sec</B></TH><TH><B>Min</B></TH><TH><B>Avg</B></TH><TH><B>Max</B></TH><TH><B>Hourly Trend</B></TH><TH><B>Std Deviation</B></TH><TH><B>10% of Outliers Removed</B></TH><TH><B>20% of Outliers Removed</B></TH><TH><B>30% of Outliers Removed</B></TH></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/C:</TD><TD>1</TD><TD>7</TD><TD>310</TD><TD>0</TD><TD>11</TD><TD>5</TD><TD>5</TD><TD>5</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/D:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/E:</TD><TD>0</TD><TD>24</TD><TD>164</TD><TD>-1</TD><TD>11</TD><TD>22</TD><TD>21</TD><TD>20</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/HarddiskVolume5</TD><TD>0</TD><TD>0</TD><TD>2</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/L:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/T:</TD><TD>0</TD><TD>7</TD><TD>430</TD><TD>0</TD><TD>21</TD><TD>3</TD><TD>2</TD><TD>2</TD></TR>
</TABLE>
The Table ID is constant among all the output files, but the amount of table rows is not. Any help is appreciated!
To extract a table from HTML, you first need to open your developer tools to see how the HTML looks and verify if it really is a table and not some other element. You open developer tools with the F12 key, see the “Elements” tab, and highlight the element you're interested in.
Get-Content cmdlet is used to read content of a html file.
PowerShell provides a built-in cmdlet called ConvertTo-Html. This takes objects as input and converts each of them to an HTML web page. To use this, just take the output and pipe it directly to ConvertTo-Html. The cmdlet will then return a big string of HTML.
The ConvertTo-Html cmdlet converts . NET objects into HTML that can be displayed in a Web browser. You can use this cmdlet to display the output of a command in a Web page.
OK, this isn't thoroughly tested but works with your example table in PS 2.0 with IE11:
# Parsing HTML with IE.
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate("file.html")
$oHtmlDoc = $oIE.Document
# Getting table by ID.
$oTable = $oHtmlDoc.getElementByID("table6")
# Extracting table rows as a collection.
$oTbody = $oTable.childNodes | Where-Object { $_.tagName -eq "tbody" }
$cTrs = $oTbody.childNodes | Where-Object { $_.tagName -eq "tr" }
# Creating a collection of table headers.
$cThs = $cTrs[0].childNodes | Where-Object { $_.tagName -eq "th" }
$cHeaders = @()
foreach ($oTh in $cThs) {
$cHeaders += `
($oTh.childNodes | Where-Object { $_.tagName -eq "b" }).innerHTML
}
# Converting rows to a collection of PS objects exportable to CSV.
$cCsv = @()
foreach ($oTr in $cTrs) {
$cTds = $oTr.childNodes | Where-Object { $_.tagName -eq "td" }
# Skipping the first row (headers).
if ([String]::IsNullOrEmpty($cTds)) { continue }
$oRow = New-Object PSObject
for ($i = 0; $i -lt $cHeaders.Count; $i++) {
$oRow | Add-Member -MemberType NoteProperty -Name $cHeaders[$i] `
-Value $cTds[$i].innerHTML
}
$cCsv += $oRow
}
# Closing IE.
$oIE.Quit()
# Exporting CSV.
$cCsv | Export-Csv -Path "file.csv" -NoTypeInformation
Honestly, I didn't aim for optimal code. It's just an example of how you could work with DOM objects in PS and convert them to PS objects.
I see you accepted an answer but I thought I'd add a RegEx solution in here too. No COM objects needed for this one, and should be PSv2 friendly I'm pretty sure.
$Path = 'C:\Path\To\File.html'
[regex]$regex = "(?s)<TABLE ID=.*?</TABLE>"
$tables = $regex.matches((GC C:\Temp\test.txt -raw)).groups.value
ForEach($String in $tables){
$table = $string.split("`n")
$CurTable = @()
$CurTableName = ([regex]'TABLE ID="([^"]*)"').matches($table[0]).groups[1].value
$CurTable += ($table[1] -replace "</B></TH><TH><B>",",") -replace "</?(TR|TH|B)>"
$CurTable += $table[2..($table.count-2)]|ForEach{$_ -replace "</TD><TD>","," -replace "</?T(D|R)>"}
$CurTable | convertfrom-csv | export-csv "C:\Path\To\Output\$CurTableName.csv" -notype
}
That should output a CSV file for each table found. Such as table6.csv, table9.csv etc. If you wanted to output CSVs per HTML file you could wrap the entire thing in a ForEach loop like:
ForEach($File in (Get-ChildItem "$Path\*.html")){
Insert above code here
}
You would need to modify the $tables =
line so that it was GC $file.fullname
to that it would load up each file as it iterated through.
Then just modify the Export-Csv to something like:
$CurTable | convertfrom-csv | export-csv "C:\Path\To\Output\$($File.BaseName)\$CurTableName.csv" -notype
So if you had Server01.html with 3 tables in it you would get a folder named Server01 with 3 CSV files in it, one for each table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With