Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract specific tables from html file using native powershell commands?

I make use of the PAL tool (https://pal.codeplex.com/) to generate HTML reports from perfmon logs within Windows. After PAL processes .blg files from perfmon it dumps the information into an HTML document that contains tables with various data points about how the system performed. I am currently writing a script that looks at the contents of a directory for all HTML files, and does a get-content on all the HTML files.

What I would like to do is scrape the dump of this get-content blob for specific tables that have varying amount of rows. Is it possible using native powershell cmdlets to look for specific tables, count how many rows are in each table, and dump just the desired tables and table rows?

Here is an example of the table format I'm trying to scrape:

<H3>Overall Counter Instance Statistics</H3>
<TABLE ID="table6" BORDER=1 CELLPADDING=2>
<TR><TH><B>Condition</B></TH><TH><B>\LogicalDisk(*)\Disk Transfers/sec</B></TH><TH><B>Min</B></TH><TH><B>Avg</B></TH><TH><B>Max</B></TH><TH><B>Hourly Trend</B></TH><TH><B>Std Deviation</B></TH><TH><B>10% of Outliers Removed</B></TH><TH><B>20% of Outliers Removed</B></TH><TH><B>30% of Outliers Removed</B></TH></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/C:</TD><TD>1</TD><TD>7</TD><TD>310</TD><TD>0</TD><TD>11</TD><TD>5</TD><TD>5</TD><TD>5</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/D:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/E:</TD><TD>0</TD><TD>24</TD><TD>164</TD><TD>-1</TD><TD>11</TD><TD>22</TD><TD>21</TD><TD>20</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/HarddiskVolume5</TD><TD>0</TD><TD>0</TD><TD>2</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/L:</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD><TD>0</TD></TR>
<TR><TD>No Thresholds</TD><TD>MACHINENAME/T:</TD><TD>0</TD><TD>7</TD><TD>430</TD><TD>0</TD><TD>21</TD><TD>3</TD><TD>2</TD><TD>2</TD></TR>
</TABLE>

The Table ID is constant among all the output files, but the amount of table rows is not. Any help is appreciated!

like image 720
Tom A. Avatar asked Sep 19 '14 18:09

Tom A.


People also ask

How do I extract a table from HTML?

To extract a table from HTML, you first need to open your developer tools to see how the HTML looks and verify if it really is a table and not some other element. You open developer tools with the F12 key, see the “Elements” tab, and highlight the element you're interested in.

How do I read an HTML file in PowerShell?

Get-Content cmdlet is used to read content of a html file.

How do I export output from PowerShell to HTML?

PowerShell provides a built-in cmdlet called ConvertTo-Html. This takes objects as input and converts each of them to an HTML web page. To use this, just take the output and pipe it directly to ConvertTo-Html. The cmdlet will then return a big string of HTML.

Can be used to convert the output to HTML and display the HTML file in PowerShell?

The ConvertTo-Html cmdlet converts . NET objects into HTML that can be displayed in a Web browser. You can use this cmdlet to display the output of a command in a Web page.


2 Answers

OK, this isn't thoroughly tested but works with your example table in PS 2.0 with IE11:

# Parsing HTML with IE.
$oIE = New-Object -ComObject InternetExplorer.Application
$oIE.Navigate("file.html")
$oHtmlDoc = $oIE.Document

# Getting table by ID.
$oTable = $oHtmlDoc.getElementByID("table6")

# Extracting table rows as a collection.
$oTbody = $oTable.childNodes | Where-Object { $_.tagName -eq "tbody" }
$cTrs = $oTbody.childNodes | Where-Object { $_.tagName -eq "tr" }

# Creating a collection of table headers.
$cThs = $cTrs[0].childNodes | Where-Object { $_.tagName -eq "th" }
$cHeaders = @()
foreach ($oTh in $cThs) {
    $cHeaders += `
        ($oTh.childNodes | Where-Object { $_.tagName -eq "b" }).innerHTML
}

# Converting rows to a collection of PS objects exportable to CSV.
$cCsv = @()
foreach ($oTr in $cTrs) {
    $cTds = $oTr.childNodes | Where-Object { $_.tagName -eq "td" }
    # Skipping the first row (headers).
    if ([String]::IsNullOrEmpty($cTds)) { continue }
    $oRow = New-Object PSObject
    for ($i = 0; $i -lt $cHeaders.Count; $i++) {
        $oRow | Add-Member -MemberType NoteProperty -Name $cHeaders[$i] `
            -Value $cTds[$i].innerHTML
    }
    $cCsv += $oRow
}

# Closing IE.
$oIE.Quit()

# Exporting CSV.
$cCsv | Export-Csv -Path "file.csv" -NoTypeInformation

Honestly, I didn't aim for optimal code. It's just an example of how you could work with DOM objects in PS and convert them to PS objects.

like image 181
Alexander Obersht Avatar answered Sep 22 '22 12:09

Alexander Obersht


I see you accepted an answer but I thought I'd add a RegEx solution in here too. No COM objects needed for this one, and should be PSv2 friendly I'm pretty sure.

$Path = 'C:\Path\To\File.html'
[regex]$regex = "(?s)<TABLE ID=.*?</TABLE>"
$tables = $regex.matches((GC C:\Temp\test.txt -raw)).groups.value
ForEach($String in $tables){
    $table = $string.split("`n")
    $CurTable = @()
    $CurTableName = ([regex]'TABLE ID="([^"]*)"').matches($table[0]).groups[1].value
    $CurTable += ($table[1] -replace "</B></TH><TH><B>",",") -replace "</?(TR|TH|B)>"
    $CurTable += $table[2..($table.count-2)]|ForEach{$_ -replace "</TD><TD>","," -replace "</?T(D|R)>"}
    $CurTable | convertfrom-csv | export-csv "C:\Path\To\Output\$CurTableName.csv" -notype
}

That should output a CSV file for each table found. Such as table6.csv, table9.csv etc. If you wanted to output CSVs per HTML file you could wrap the entire thing in a ForEach loop like:

ForEach($File in (Get-ChildItem "$Path\*.html")){
    Insert above code here
}

You would need to modify the $tables = line so that it was GC $file.fullname to that it would load up each file as it iterated through.

Then just modify the Export-Csv to something like:

$CurTable | convertfrom-csv | export-csv "C:\Path\To\Output\$($File.BaseName)\$CurTableName.csv" -notype

So if you had Server01.html with 3 tables in it you would get a folder named Server01 with 3 CSV files in it, one for each table.

like image 43
TheMadTechnician Avatar answered Sep 19 '22 12:09

TheMadTechnician