Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiline Regex in PowerShell

I have this PowerShell script that's main purpose is to search through HTML files within a folder, find specific HTML markup, and replace with what I tell it to.

I have been able to do 3/4 of my find and replaces perfectly. The one I am having trouble with involves a Regular Expression.

This is the markup that I am trying to make my regex find and replace:

<a href="programsactivities_skating.html"><br />
                                           </a>

Here is the regex I have so far, along with the function I am using it in:

automate -school "C:\Users\$env:username\Desktop\schools\$question" -query '(?mis)(?!exclude1|exclude2|exclude3)(<a[^>]*?>(\s|&nbsp;|<br\s?/?>)*</a>)' -replace ''

And here is the automate function:

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = Get-Content $file
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}

I have been trying to figure out the solution to this for about 2 days now, and just can't seem to get it to work. I have determined that problem is that I need to tell my regex to account for Multiline, and that's what I'm having trouble with.

Any help anyone can provide is greatly appreciate.

Thanks in Advance.

like image 837
Matt Bettiol Avatar asked Feb 20 '14 14:02

Matt Bettiol


1 Answers

Get-Content produces an array of strings, where each string contains a single line from your input file, so you won't be able to match text passages spanning more than one line. You need to merge the array into a single string if you want to be able to match more than one line:

$text = Get-Content $file | Out-String

or

[String]$text = Get-Content $file

or

$text = [IO.File]::ReadAllText($file)

Note that the 1st and 2nd method don't preserve line breaks from the input file. Method 2 simply mangles all line breaks, as Keith pointed out in the comments, and method 1 puts <CR><LF> at the end of each line when joining the array. The latter may be an issue when dealing with Linux/Unix or Mac files.

like image 118
Ansgar Wiechers Avatar answered Sep 23 '22 17:09

Ansgar Wiechers