Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does removal of empty lines from multiline string in PowerShell fail using Replace function?

I am loading HTML emails and at first I remove the HTML tags, I replace each   by a space and I reduce the double spaces by a single space - that works.

But now I have a lot of empty lines which I cannot remove. I have seen the examples which remove empty lines while reading a file, but I don't have any empty lines before I remove the HTML tags and the spaces.

I do:

$m = [IO.File]::ReadAllText("$emailFolder\$fName")
$m = $m -replace "<((?!@).)*?>" # removes all html tag but not adr: <[email protected]>
$m = $m -replace "&nbsp;"," "
$m = $m.Replace('  ',' ').Replace('  ',' ').Replace('  ',' ')
$m = $m.Replace('`r','').Replace('`n`n','`n').Replace('`n`n','`n') # does nothing :(

I tried various version, none of them removed the empty lines. Any idea, how I can achieve that?

Beside that I tried to use the regex multiplier to find spaces in a row and failed.

What I'm doing wrong?

$m = $m.Replace(' +',' ')  # does not work
$m = $m.Replace('\s+',' ') # does not work either
like image 503
gooly Avatar asked Aug 03 '14 16:08

gooly


2 Answers

If I understand you correctly, you don't want to remove all line breaks, just "empty" lines (lines that consist of nothing but whitespace).

Consider this sample string:

$multiLine = "Line 1`r`nLine 2`nLine 3`r`n`r`n  `n `t `r`nLine 7`r`n"

When displayed, it will look like this on screen:

Line 1
Line 2
Line 3



Line 7

Line 4 is actually a blank line, with nothing but a CRLF. Line 5 is a space followed by a single LF, Line 6 is a space, a tab, a space, then a CRLF. I mixed line endings because HTML can be a mess; it's good to be prepared for anything!

To handle all of these, you can do a replace like this:

$multiLine -creplace '(?m)^\s*\r?\n',''

What Does This Do?

  1. -creplace is just the case-sensitive version of -replace (I like to be explicit).
  2. (?m) is an inline way to set regular expression modes. The m mode stands for multi-line, and it lets the ^ and $ anchors match the beginning/end of each line in a string (rather than the beginning and end of the string). This is the key to your issue, I think.
  3. We're using ^ to match the beginning of each line, then matching 0 or more whitespace using the \s class, which includes tab.
  4. We're matching an optional carriage return (for Windows line breaks), followed by a line break. We don't need to match multiples of these because ^ will catch them throughout the string.

The Resulting Output

Line 1
Line 2
Line 3
Line 7
like image 77
briantist Avatar answered Oct 05 '22 23:10

briantist


This seems to work:

$m -replace '(?ms)(?:\r|\n)^\s*$'
like image 33
mjolinor Avatar answered Oct 05 '22 23:10

mjolinor