Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Windows PowerShell parse HTML local file

I would like to build an array from an HTML file using PowerShell.

I am using a script which download the HTML File from the Mozilla Firefox Developer Edition (I am downloading the index file) locally and I would like to parse it to get the value of the options elements inside the select element which have the id set to id_country.

I have been recommended to use XPath for that but I can't figure how to parse the file and build an array from the result. Maybe using regex could be a workaround.

The HTML file is here :

http://pastebin.com/b8cShFLA

And I would like to all the values of the options elements here:

<select aria-required="true" id="id_country" name="country" required="required">
   <option value="af">Afghanistan</option>
   <option value="al">Albania</option>
   <option value="dz">Algeria</option>
   <option value="as">American Samoa</option>
   <option value="ad">Andorra</option>

...

I am quite new to PowerShell that's why I am not really aware of different solutions I might be able to use. I would need something quite fast as it's part of a package installer.

Basically the script will try to see if there is an installer which match the locale of the user's computer and if not it will default to english that's why I need to get the values from that list in order to check the firefox dev available locales.

Regards, O

like image 911
anchnk Avatar asked May 13 '26 17:05

anchnk


2 Answers

I don't see a code sample to fix, so I'll make one.

If it was a remote html I would use Invoke-WebRequest, but that doesn't work too well with local files.

For parsing of local files I would recommend using HTML Agility Pack to parse the HTML file, and then use xPath to get the options you're looking for. Ex.

Add-Type -Path .\HTMLAgilityPack\HtmlAgilityPack.dll
$url = (get-item .\b8cShFLA.html).FullName

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml((get-content $url))

#Create hashtable to store data in
$langs = @{}

$doc.DocumentNode.SelectSingleNode("//select[@name='country']").SelectNodes("option") | ForEach-Object {
    $short = $_.Attributes[0].Value
    $long = $_.NextSibling.InnerText

    #Store data in hashtable
    $langs[$short] = $long
}

$langs

Ouput:

Name                           Value
----                           -----
rw                             Rwanda
tv                             Tuvalu
to                             Tonga
pn                             Pitcairn
bh                             Bahrain
lc                             Saint Lucia   
like image 175
Frode F. Avatar answered May 15 '26 06:05

Frode F.


If you're running PS 3.0 or above, you can take advantage of Invoke-WebRequest for pages that exist out on the web. If you're operating against a local file, it can be a bit finicky.

Invoke-WebRequest returns a HtmlWebResponseObject with a property called ParsedHtml. This object has a method named getElementById, which we can use since we know the id "id_country" on your select tag. From there, it is a simple matter to iterate the options tags and filter down to return the properties we would like... "Text" and "value".

The example below outputs a custom object containing the country name and the country code:

Code:

# I'm using your raw pastebin endpoint for this example
$result = Invoke-WebRequest "http://pastebin.com/raw.php?i=b8cShFLA"

# Only return specific properties from the elements you're looking for
$countries = $result.ParsedHtml.getElementById("id_country") | 
    Where tagName -eq "option" | 
    Select -Property Text, Value

# Country name and code are stored to this variable
$countries

Output:

text                                                        value
----                                                        -----
Afghanistan                                                 af
Albania                                                     al
Algeria                                                     dz
American Samoa                                              as
Andorra                                                     ad
...                                                         ...

You can then use the country name and code as you would any other property on powershell objects.

As for the web endpoint, it sounds like you could modify this script to point to the original Mozilla page you're extracting this HTML from?

like image 30
Anthony Neace Avatar answered May 15 '26 06:05

Anthony Neace