Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PowerShell: Is `$matches` guaranteed to carry down the pipeline in sync with the pipeline variable?

Tags:

powershell

First, make some example files:

2010..2015 | % { "" | Set-Content "example $_.txt" }

#example 2010.txt                                                                          
#example 2011.txt                                                                          
#example 2012.txt                                                                          
#example 2013.txt                                                                          
#example 2014.txt                                                                          
#example 2015.txt

What I want to do is match the year with a regex capture group, then reference the match with $matches[1] and use it. I can write this to do both in one scriptblock, in one cmdlet, and it works fine:

gci *.txt | foreach { 

    if ($_ -match '(\d+)')       # regex match the year
    {                            # on the current loop variable
        $matches[1]              # and use the capture group immediately
    } 

}
#2010
#2011
#.. etc

I can also write this to do the match in one scriptblock, and then reference $matches in another cmdlet's scriptblock later on:

gci *.txt | where { 

    $_ -match '(\d+)'     # regex match here, in the Where scriptblock

} | foreach {             # pipeline!

    $matches[1]           # use $matches which was set in the previous 
                          # scriptblock, in a different cmdlet
}

Which has the same output and it appears to work fine. But is it guaranteed to work, or is it undefined and a coincidence?

Could 'example 2012.txt' get matched, then buffered. 'example 2013.txt' gets matched, then buffered. | foreach gets to work on 'example 2012.txt' but $matches has already been updated with 2013 and they're out of sync?

I can't make them fall out of sync - but I could still be relying on undefined behaviour.

(FWIW, I prefer the first approach for clarity and readability as well).

like image 503
TessellatingHeckler Avatar asked Dec 01 '16 21:12

TessellatingHeckler


2 Answers

There is no synchronization going on, per se. The second example works because of the way the pipeline works. As each single object gets passed along by satisfying the condition in Where-Object, the -Process block in ForEach-Object immediately processes it, so $Matches hasn't yet been overwritten from any other -match operation.

If you were to do something that causes the pipeline to gather objects before passing them on, like sorting, you would be in trouble:

gci *.txt | where { 

    $_ -match '(\d+)'     # regex match here, in the Where scriptblock

} | sort | foreach {             # pipeline!

    $matches[1]           # use $matches which was set in the previous 
                          # scriptblock, in a different cmdlet
}

For example, the above should fail, outputting n objects, but they will all be the very last match.

So it's prudent not to rely on that, because it obscures the danger. Someone else (or you a few months later) may not think anything of inserting a sort and then be very confused by the result.

As TheMadTechnician pointed out in the comments, the placement changes things. Put the sort after the part where you reference $Matches (in the foreach), or before you filter with where, and it will still work as expected.

I think that drives home the point that it should be avoided, as it's fairly unclear. If the code changes in parts of the pipeline you don't control, then the behavior may end up being different, unexpectedly.


I like to throw in some verbose output to demonstrate this sometimes:

Original

gci *.txt | where { 
    "Where-Object: $_" | Write-Verbose -Verbose
    $_ -match '(\d+)'     # regex match here, in the Where scriptblock

} | foreach {             # pipeline!
    "ForEach-Object: $_" | Write-Verbose -Verbose
    $matches[1]           # use $matches which was set in the previous 
                          # scriptblock, in a different cmdlet
}

Sorted

gci *.txt | where { 
    "Where-Object: $_" | Write-Verbose -Verbose
    $_ -match '(\d+)'     # regex match here, in the Where scriptblock

} | sort | foreach {             # pipeline!
    "ForEach-Object: $_" | Write-Verbose -Verbose
    $matches[1]           # use $matches which was set in the previous 
                          # scriptblock, in a different cmdlet
}

The difference you'll see is that in the original, as soon as where "clears" an object, foreach gets it right away. In the sorted, you can see all of the wheres happening first, before foreach gets any of them.

sort doesn't have any verbose output so I didn't bother calling it that way, but essentially its Process {} block just collects all of objects so it can compare (sort!) them, then spits them out in the End {} block.


More examples

First, here's a function that mocks Sort-Object's collection of objects (it doesn't actually sort them or do anything):

function mocksort {
[CmdletBinding()]
param(
    [Parameter(
        ValueFromPipeline
    )]
    [Object]
    $O
)

    Begin {
        Write-Verbose "Begin (mocksort)"

        $objects = @()
    }

    Process {
        Write-Verbose "Process (mocksort): $O (nothing passed, collecting...)"

        $objects += $O
    }

    End {
        Write-Verbose "End (mocksort): returning objects"

        $objects
    }
}

Then, we can use that with the previous example and some sleep at the end:

gci *.txt | where { 
    "Where-Object: $_" | Write-Verbose -Verbose
    $_ -match '(\d+)'     # regex match here, in the Where scriptblock

} | mocksort -Verbose | foreach {             # pipeline!
    "ForEach-Object: $_" | Write-Verbose -Verbose
    $matches[1]           # use $matches which was set in the previous 
                          # scriptblock, in a different cmdlet
} | % { sleep -milli 500 ; $_ }
like image 161
briantist Avatar answered Sep 28 '22 17:09

briantist


To complement briantist's great answer:

Aside from aggregating cmdlets such as Sort-Object (cmdlets that (must) collect all input first, before producing any output), the -OutBuffer common parameter can also break the command:

gci *.txt | where -OutBuffer 100 { 

    $_ -match '(\d+)'     # regex match here, in the Where scriptblock

} | foreach {             # pipeline!

    $matches[1]           # use $matches which was set in the previous 
                          # scriptblock, in a different cmdlet
}

This causes the where (Where-Object) cmdlet to buffer its first 100 output objects until the 101th object is generated, and only then send these 101 objects on, so that $matches[1] in the foreach (ForEach-Object) block will in this case only see the 101th (matching) filename's capture-group value, in every of the (first) 101 iterations.

Generally, with an -OutputBuffer value of N, the first N + 1 foreach invocations would all see the same $matches value from the (N + 1)-th input object, and so forth for subsequent batches of N + 1 objects.

From Get-Help about_CommonParameters:

When you use this parameter, Windows PowerShell does not call the next cmdlet in the pipeline until the number of objects generated equals OutBuffer + 1. Thereafter, it sends all objects as they are generated.

Note that the last sentence suggests that only the first N + 1 objects are subject to buffering, which, however, is not true, as the following example (thanks, @briantist) demonstrates:

1..5 | % { Write-Verbose -vb $_; $_ } -OutBuffer 1 | % { "[$_]" }
VERBOSE: 1
VERBOSE: 2
[1]
[2]
VERBOSE: 3
VERBOSE: 4
[3]
[4]
VERBOSE: 5
[5]

That is, -OutBuffer 1 caused all objects output by % (ForEach-Object) to be batched in groups of 2, not just the first 2.

like image 45
mklement0 Avatar answered Sep 28 '22 18:09

mklement0