Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DotAll and multiline RegEx

i got a little trouble using Rexex in Powershell. It seems like there is a imlementation error or something.

The text i want to work with is a html file, which looks like this (Example1):

<span>[Mobile: %mobile% |] Phone: %telephone% [| Fax: %faxNumber%]</span>
<Span>

The Problem is that, caused by html editors, i also may get something like this (Example2):

<span>[Mobile: 

%mobile% |] Phone: %telephone% [| Fax: &nbsp;&nbsp;%faxNumber%]</span>

So as you see, we got linebreaks and html escaped, fixed whitespaces &nbsp;.

My Powershell Regex looks like this:

$x = $x -ireplace '(?ms)\[(.?){7}Fax(.*?)\]', 'MyReplacement1'

and this

$x = $x -ireplace '(?ms)\[(.?){7}Mobile(.*?)\]', 'MyReplacement2'

Basicly The [ marks the beginning of a variable and ] the end of it. Two problems arise from this:

  1. Since we got two variables, mobile and fax, i'm using (.?){7} to allow SOME (here exacly 7) characters and avoid matching the hole part between the first [ near Mobile and the last ] near Fax (which would happen if i would be using (.*?) instead of (.?){7}). I'm not sure if there are alternatives so that i can allow ANY number (and not 7) of chars between the starting [ and the variable keyword "Fax" for example. This would be usefull to avoid missmatches when stuff like &nbsp;&nbsp; gets added (where only 7 char would not be enough and like i said (.*?) will fail). Hope i was able to explain it (kinda hard) - if not: please feel free to ask!
  2. Powershells -replace method dosn't offer a way to set regex options, therefore i got to use (?ms) to set DotAll and multiline modes. As you see, I'm using it within my regex pattern. However: when a newline is added, as you see in example2 between the words Mobile: and %mobile%, the regex fails and nothing gets replaced!

I'm greatfull for any help and even regex recommandations from the pros to avoid any further problems i'm not thinking about right now...

EDIT: (Example3):

<span>[Mobile: 

%mobile% |] Phone: %telephone% [| Fax: 
%faxNumber%]</span>
like image 710
omni Avatar asked Dec 27 '22 13:12

omni


1 Answers

The trick around DotAll mode is to use [\s\S] instead of .. This character class matches any character (because it matches space and non-space characters). (As does [\w\W] or [\d\D], but the spaces seem to be kind of a convention.)

To get around the 7 you can simply disallow closing ] before the one you actually want to match (that by the way also makes DotAll unnecessary). So something like this should work fine for you:

\[([^\]:]*)Fax([^\]]*)\]

It looks a bit ugly, but it simply means this:

\[        # literal [
(         # capturing group 1
  [^\]:]* # match as many non-:, non-] characters as possible
)         # end of group 1
Fax       # literal Fax
(         # capturing group 2
  [^\]]*  # match as many non-] characters as possible
)         # end of group 2
\]        # literal ]

Further reading on character classes.

Note that none of these patterns need multiline mode m (neither yours nor mine), because all it does is make ^ and $ match line beginnings and endings, respectively. But none of the patterns contain these meta-characters. So the modifier does not do anything.

My console output:

PS> $x = "<span>[Mobile: %mobile% |] Phone: %telephone% [| Fax: &nbsp;&nbsp;%faxNumber%]</span>"
PS> $x -ireplace '\[([^\]:]*)Mobile([^\]]*)\]', 'MyReplacement1'
<span>MyReplacement1 Phone: %telephone% [| Fax: &nbsp;&nbsp;%faxNumber%]</span>
PS> $x -ireplace '\[([^\]:]*)Fax([^\]]*)\]', 'MyReplacement2'
<span>[Mobile: %mobile% |] Phone: %telephone% MyReplacement2</span>
like image 98
Martin Ender Avatar answered Jan 07 '23 14:01

Martin Ender