Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Powershell remove HTML tags in string content

I have a large HTML data string separated into small chunks. I am trying to write a PowerShell script to remove all the HTML tags, but am finding it difficult to find the right regex pattern.

Example String:

<p>This is an example</br>of various <span style="color: #445444">html content</span>

I have tried using:

$string -replace '\<([^\)]+)\>',''

It works with simple examples but ones such as above it captures the whole string.

Any suggestions on whats the best way to achieve this?

Thanks in advance

like image 289
Arturski Avatar asked Apr 28 '15 21:04

Arturski


3 Answers

For a pure regex, it should be as easy as <[^>]+>:

$string -replace '<[^>]+>',''

Regular expression visualization

Debuggex Demo

Note that this could fail with certain HTML comments or the contents of <pre> tags.

Instead, you could use the HTML Agility Pack (alternative link), which is designed for use in .Net code, and I've used it successfully in PowerShell before:

Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($string)
$doc.DocumentNode.InnerText

HTML Agility Pack works well with non-perfect HTML.

like image 151
briantist Avatar answered Nov 01 '22 09:11

briantist


To resolve umlauts and special characters I used a html Object. Here is my function:

Function ConvertFrom-Html
{
    <#
        .SYNOPSIS
            Converts a HTML-String to plaintext.

        .DESCRIPTION
            Creates a HtmlObject Com object und uses innerText to get plaintext. 
            If that makes an error it replaces several HTML-SpecialChar-Placeholders and removes all <>-Tags via RegEx.

        .INPUTS
            String. HTML als String

        .OUTPUTS
            String. HTML-Text als Plaintext

        .EXAMPLE
        $html = "<p><strong>Nutzen:</strong></p><p>Der&nbsp;Nutzen ist &uuml;beraus gro&szlig;.<br />Test ob 3 &lt; als 5 &amp; &quot;4&quot; &gt; &apos;2&apos; it?"
        ConvertFrom-Html -Html $html
        $html | ConvertFrom-Html

        Result:
        "Nutzen:
        Der Nutzen ist überaus groß.
        Test ob 3 < als 5 ist & "4" > '2'?"


        .Notes
            Author: Ludwig Fichtinger FILU
            Inital Creation Date: 01.06.2021
            ChangeLog: v2 20.08.2021 try catch with replace for systems without Internet Explorer

    #>

    [CmdletBinding(SupportsShouldProcess = $True)]
    Param(
        [Parameter(Mandatory = $true, Position = 0, ValueFromPipeline = $true, HelpMessage = "HTML als String")]
        [AllowEmptyString()]
        [string]$Html
    )

    try
    {
        $HtmlObject = New-Object -Com "HTMLFile"
        $HtmlObject.IHTMLDocument2_write($Html)
        $PlainText = $HtmlObject.documentElement.innerText
    }
    catch
    {
        $nl = [System.Environment]::NewLine
        $PlainText = $Html -replace '<br>',$nl
        $PlainText = $PlainText -replace '<br/>',$nl
        $PlainText = $PlainText -replace '<br />',$nl
        $PlainText = $PlainText -replace '</p>',$nl
        $PlainText = $PlainText -replace '&nbsp;',' '
        $PlainText = $PlainText -replace '&Auml;','Ä'
        $PlainText = $PlainText -replace '&auml;','ä'
        $PlainText = $PlainText -replace '&Ouml;','Ö'
        $PlainText = $PlainText -replace '&ouml;','ö'
        $PlainText = $PlainText -replace '&Uuml;','Ü'
        $PlainText = $PlainText -replace '&uuml;','ü'
        $PlainText = $PlainText -replace '&szlig;','ß'
        $PlainText = $PlainText -replace '&amp;','&'
        $PlainText = $PlainText -replace '&quot;','"'
        $PlainText = $PlainText -replace '&apos;',"'"
        $PlainText = $PlainText -replace '<.*?>',''
        $PlainText = $PlainText -replace '&gt;','>'
        $PlainText = $PlainText -replace '&lt;','<'
    }

    return $PlainText
}

Example:

"<p><strong>Nutzen:</strong></p><p>Der&nbsp;Nutzen ist &uuml;beraus gro&szlig;.<br />Test ob 3 &lt; als 5 ist &amp; &quot;4&quot; &gt; &apos;2&apos;?" | ConvertFrom-Html

Result:

Nutzen:
Der Nutzen ist überaus groß.
Test ob 3 < als 5 ist & "4" > '2'?
like image 27
Ludwig Fichtinger Avatar answered Nov 01 '22 08:11

Ludwig Fichtinger


You can try this:

$string -replace '<.*?>',''
like image 1
Giedrius Avatar answered Nov 01 '22 09:11

Giedrius