I have a large HTML data string separated into small chunks. I am trying to write a PowerShell script to remove all the HTML tags, but am finding it difficult to find the right regex pattern.
Example String:
<p>This is an example</br>of various <span style="color: #445444">html content</span>
I have tried using:
$string -replace '\<([^\)]+)\>',''
It works with simple examples but ones such as above it captures the whole string.
Any suggestions on whats the best way to achieve this?
Thanks in advance
For a pure regex, it should be as easy as <[^>]+>
:
$string -replace '<[^>]+>',''
Debuggex Demo
Note that this could fail with certain HTML comments or the contents of <pre>
tags.
Instead, you could use the HTML Agility Pack (alternative link), which is designed for use in .Net code, and I've used it successfully in PowerShell before:
Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'
$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($string)
$doc.DocumentNode.InnerText
HTML Agility Pack works well with non-perfect HTML.
To resolve umlauts and special characters I used a html Object. Here is my function:
Function ConvertFrom-Html
{
<#
.SYNOPSIS
Converts a HTML-String to plaintext.
.DESCRIPTION
Creates a HtmlObject Com object und uses innerText to get plaintext.
If that makes an error it replaces several HTML-SpecialChar-Placeholders and removes all <>-Tags via RegEx.
.INPUTS
String. HTML als String
.OUTPUTS
String. HTML-Text als Plaintext
.EXAMPLE
$html = "<p><strong>Nutzen:</strong></p><p>Der Nutzen ist überaus groß.<br />Test ob 3 < als 5 & "4" > '2' it?"
ConvertFrom-Html -Html $html
$html | ConvertFrom-Html
Result:
"Nutzen:
Der Nutzen ist überaus groß.
Test ob 3 < als 5 ist & "4" > '2'?"
.Notes
Author: Ludwig Fichtinger FILU
Inital Creation Date: 01.06.2021
ChangeLog: v2 20.08.2021 try catch with replace for systems without Internet Explorer
#>
[CmdletBinding(SupportsShouldProcess = $True)]
Param(
[Parameter(Mandatory = $true, Position = 0, ValueFromPipeline = $true, HelpMessage = "HTML als String")]
[AllowEmptyString()]
[string]$Html
)
try
{
$HtmlObject = New-Object -Com "HTMLFile"
$HtmlObject.IHTMLDocument2_write($Html)
$PlainText = $HtmlObject.documentElement.innerText
}
catch
{
$nl = [System.Environment]::NewLine
$PlainText = $Html -replace '<br>',$nl
$PlainText = $PlainText -replace '<br/>',$nl
$PlainText = $PlainText -replace '<br />',$nl
$PlainText = $PlainText -replace '</p>',$nl
$PlainText = $PlainText -replace ' ',' '
$PlainText = $PlainText -replace 'Ä','Ä'
$PlainText = $PlainText -replace 'ä','ä'
$PlainText = $PlainText -replace 'Ö','Ö'
$PlainText = $PlainText -replace 'ö','ö'
$PlainText = $PlainText -replace 'Ü','Ü'
$PlainText = $PlainText -replace 'ü','ü'
$PlainText = $PlainText -replace 'ß','ß'
$PlainText = $PlainText -replace '&','&'
$PlainText = $PlainText -replace '"','"'
$PlainText = $PlainText -replace ''',"'"
$PlainText = $PlainText -replace '<.*?>',''
$PlainText = $PlainText -replace '>','>'
$PlainText = $PlainText -replace '<','<'
}
return $PlainText
}
Example:
"<p><strong>Nutzen:</strong></p><p>Der Nutzen ist überaus groß.<br />Test ob 3 < als 5 ist & "4" > '2'?" | ConvertFrom-Html
Result:
Nutzen:
Der Nutzen ist überaus groß.
Test ob 3 < als 5 ist & "4" > '2'?
You can try this:
$string -replace '<.*?>',''
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With