Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching strings in PowerShell

The gist of this question is:

So it seems that if ($C -match $b.Name) considers a partial match of a string a match? Is there a better way to force a complete [match] of a string?

I've got a directory that gets populated with a ton of .7z files. I need to clean this directory constantly. There is another script, that predates my employment here, that's currently working, but it's made up of 3000 lines and generates incorrect matches constantly and doesn't log what moved or what was deleted. Part of what makes it so big is that it has a ton of paths for where these files need to be moved to hard coded in it. Occasionally, these paths change and it's a pain in the butt to update.

So I've started making a much smaller script that has all these paths referenced in a CSV file. In addition to these paths, the CSV file also has known filenames recorded in it.

I'm trying to match the file names against the recorded names in my CSV file. It generally works, but sometimes I get incorrect matches.

Let's say I have two files that start similarly, Apple and Apple_Pie. Apple will match to Apple and move to the right directory, but Apple_Pie will first match to Apple and move to the wrong directory. Before the $C variable is cleared out it will match Apple_Pie to the right directory, but by that point Apple_Pie no longer exists in the original directory it needs to be moved from.

So it seems that if ($C -match $b.Name) considers a partial match of a string a match? Is there a better way to force a complete of a string?

I assume that I'm a bit off with my expectations of how -matchshould work.

The regex stuff I have here is to strip each file name of the date-time that gets added to the file name by another automated process. I use this to isolate the file name I want to match.

$Wild = "C:\Some\Folder\With\Files\"

$CSV = "C:\Another\Folder\Paths.csv"

$Content = gci $wild

$Reg1 = [regex] '_[0-9]{4}-[0-9]{2}-[0-9]{2}[A-Z]{1}[0-9]{2}_[0-9]{2}_[0-9]{2}'

$Reg2 = [regex] '[0-9]{4}-[0-9]{2}-[0-9]{2}[A-Z]{1}[0-9]{2}_[0-9]{2}_[0-9]{2}'

$Paths = import-csv -path $CSV -header Name, Path

foreach ($a in $content) {
    $c = $a.BaseName

    if ($c -match $reg1) {
        $c = $c -replace $regyear
    }
    elseif ($c -match $reg2) {
        $c = $c -replace $reg2
    }

    foreach ($b in $Paths) {

        if ($c -match $b.Name) {
            Do something
        }
    }
}
like image 580
Ochuse Avatar asked Jul 18 '18 15:07

Ochuse


2 Answers

tl;dr

  • -match indeed makes the regex on the RHS (right-hand side) match substrings by default:
    • 'foo' -match 'o' # true
  • However, you can anchor the regex with ^ to match the start of the input string, and/or $ to match the end:
    • 'foo' -match '^foo$' # true - full match
    • 'foot' -match '^foo$' # false

Read on for details and information about other string-matching operators.


Preface:

  • PowerShell string-comparison operators are case-insensitive by default and use the invariant culture rather than the current culture.

    • You can opt into case-sensitive matching by using prefix c;
      e.g., -cmatch instead of -match.
  • All comparison operators can be negated with prefix not; e.g., -notmatch negates -match.

  • With a single string as the LHS, the comparison operators return $True or $False, but with an array of strings they act as filters; that is, they return the subarray of elements for which the comparison is true.


EBGreen's comment on the question provides the best explanation (lightly edited and emphasis added):

[...] by default, -match will return $True if the [RHS] pattern (regex) can be found anywhere in the string. If you want to find the string at certain positions within the string, use ^ to indicate the beginning of the string and $ to indicate the end of the string. To match the entire string, use both.

Applied to part of your code:

$Reg2 = '^[0-9]{4}-[0-9]{2}-[0-9]{2}[A-Z]{1}[0-9]{2}_[0-9]{2}_[0-9]{2}$'

# ...

$c -match $Reg2

Note the ^ at the start and the $ at the end to ensure that the entire input string must match.

Also note that I've omitted the [regex] cast, as it isn't necessary, given that -match can accept strings directly.

On a related note, you can use assertion \b to modify substring matching so that matching only succeeds at word boundaries (where a word is defined as any nonempty run of letters, digits, and underscores); e.g. 'a10' -match 'a1' is true, but 'a10' -match 'a1\b' is not, because the 1 in the input string is not at the end of a word.

Note that using -match with a single string as the LHS (as opposed to an array) records the details of the most recent match in the automatic $Matches variable, which is a hash table whose 0 entry contains the entire match (the part of the input string that matched); if capture groups (subexpressions enclosed in (...)) were used in the regex - entry 1 contains what the 1st capture group captured, 2 what the 2nd one captured, and so on; named capture groups (e.g.,
(?<foo>...)) get entries by their name (e.g, foo).

Also, instead of a wordy if / elseif construct for matching multiple regexes in sequence, you can use the switch statement with the -regex option:

Instead of:

if ($c -match $reg1) {
  $c = $c -replace $regyear 
}
elseif ($c -match $reg2) {
  $c = $c -replace $reg2 
}

you could write more cleanly:

switch -regex ($c) {
  $reg1 { $c = $c -replace $regyear; break }
  $reg2 { $c = $c -replace $reg2;    break }
  default { <# handles the case where nothing above matched #> }
}

break ensures that no further matching is performed.

  • switch's default matching (or with option -exact) works like the -eq operator (see below).

  • You can also make it perform wildcard-expression matching - like the -like operator (see below) - with the
    -wildcard option.

  • The -casesensitive option makes matching case-sensitive for any of the matching modes.

  • If the input is an array, matching is performed on each element; note that break then stop processing of further elements, whereas continue instantly proceeds to the next element.


Other methods of string matching in PowerShell:

-like allows you to match strings based on wildcard expressions.

Simply put, * matches any run of characters, including none, ? matches exactly 1 character, and [...] matches any one character in a specified set or range of characters.

Unlike -match, -like always matches the entire string, but note that wildcard expressions have fundamentally different syntax from regular expressions and are far less powerful - you cannot use -like and -match interchangeably.

Thus, to get substring matching, place a * o both ends of your expression; e.g.:

'ingot' -like '*go*'  # true

-eq compares entire strings, literally (except for case variations).

Note that PowerShell has no literal substring-matching operator, but you can (somewhat clumsily) emulate one with -match and [regex]::Escape():

 'Cost: 7$.' -match [regex]::Escape('7$') # true

[regex]::Escape() escapes its argument so that its content is treated literally when interpreted as a regex (which the RHS of -match invariably is).

This is somewhat inefficient, as there is no good reason to use regexes to begin with.

Direct use of the .NET [string] type's .IndexOf() method is an option, but is also nontrivial; the following is the equivalent of the previous command:

 'Cost: 7$.'.IndexOf('7$', [StringComparison]::InvariantCultureIgnoreCase) -ne -1 # true

Note the need to use InvariantCultureIgnoreCase to match PowerShell's default behavior, and the need to compare to -1, given that the character index of where the substring starts is returned.

On the flip side, this method gives you more control over how matching is performed, via the other members of the [System.StringComparison] enumeration.
If you're looking for case-sensitive substring matching based on the current culture, then you can simply rely on the default behavior of .IndexOf(); e.g.,
'I am here.'.IndexOf('am') -ne -1 # true vs.
'I am here.'.IndexOf('AM') -ne -1 # false, because matching is case-sensitive


Finally, note that the Select-String cmdlet performs string matching in the pipeline, and it supports both regexes (by default) and literal substring matching (with the -SimpleMatch) switch.

Unlike the comparison operators, Select-Object outputs a match-information object of type [Microsoft.PowerShell.Commands.MatchInfo] for each matching input line that contains the original line plus metadata about the match.

like image 157
mklement0 Avatar answered Sep 28 '22 15:09

mklement0


I think your main problem is that you are using "match".

It checks if the right string is ... part of the left one, not whether it's an actual match as you would expect.

$a = "Test"
$b = "Test_me"

$a -match $b
False

$b -match $a
True

I would replace -match with -like.

like image 45
BlueBunny Avatar answered Sep 28 '22 15:09

BlueBunny