Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Design pattern for buffering pipeline input to PowerShell cmdlet

I occasionally encounter situations where it makes sense to support pipeline input to a Cmdlet but the operations I wish to do (e.g. database access) make sense to batch up on a sensible number of objects.

A typical way to achieve this appears to be something like the following:

function BufferExample {
<#
.SYNOPSIS
Example of filling and using an intermediate buffer.
#>
[CmdletBinding()]
param(
    [Parameter(ValueFromPipeline)]
    $InputObject
)

BEGIN {
    $Buffer = New-Object System.Collections.ArrayList(10)
    function _PROCESS {
        # Do something with a batch of items here.
        Write-Output "Element 1 of the batch is $($Buffer[1])"
        # This could be a high latency operation such as a DB call where we 
        # retrieve a set of rows by primary key rather than each one individually.

        # Then empty the buffer.
        $Buffer.Clear()
    }
}

PROCESS {
    # Accumulate the buffer or process it if the buffer is full.
    if ($Buffer.Count -ne $Buffer.Capacity) {    
        [void]$Buffer.Add($InputObject)        
    } else {
        _PROCESS 
    }
}

END {
     # The buffer may be partially filled, so let's do something with the remainder.
    _PROCESS
}
}

Is there a less "boilerplate" way to do this?

One method may be to write the function which I call "_PROCESS" here to accept array argument(s) but not pipeline input and then for the cmdlet exposed to the user to be a proxy function built to buffer the input and pass on the buffer as described in Proxy commands.

Alternatively I could dot source dynamic code in the body of the cmdlet I wish to write to support this functionality, however this seems error prone and potentially hard to debug / understand.

like image 992
user1383092 Avatar asked Jan 04 '17 14:01

user1383092


2 Answers

I made a re-usable cmdlet from your example.

function Buffer-Pipeline {
[CmdletBinding()]
param(
    $Size = 10,
    [Parameter(ValueFromPipeline)]
    $InputObject
)
    BEGIN {
        $Buffer = New-Object System.Collections.ArrayList($Size)
    }

    PROCESS {
        [void]$Buffer.Add($_)

        if ($Buffer.Count -eq $Size) {    
            $b = $Buffer;
            $Buffer = New-Object System.Collections.ArrayList($Size)
            Write-Output -NoEnumerate $b
        }
    }

    END {
     # The buffer may be partially filled, so let's do something with the remainder.
        if ($Buffer.Count -ne 0) {  
            Write-Output -NoEnumerate $Buffer
        }
    }
}

Usage:

@(1;2;3;4;5) | Buffer-Pipeline -Size 3 | % { "$($_.Count) items: ($($_ -join ','))" }

Output:

3 items: (1,2,3)
2 items: (4,5)

Another example:

1,2,3,4,5 | Buffer-Pipeline -Size 10 | Measure-Object

Count    : 2

Processing each batch:

 1,2,3 | Buffer-Pipeline -Size 2 | % { 
    $_ | % { "Starting batch of $($_.Count)" } { $_ * 100 } { "Done with batch of $($_.Count)" }
}

Starting batch of 2
100
200
Done with batch of 2
Starting batch of 1
300
Done with batch of 1
like image 170
Fowl Avatar answered Sep 20 '22 01:09

Fowl


The nature of the pipeline puts some constraints on (easily) doing this the way you'd like, mostly because a Process block is designed to receive (and process) a single object.

If you want to buffer all objects, that's fairly simple; you just collect all the objects in an array or other collection within your Process block, and then do all the work in the End block; similar to the way a cmdlet like Sort-Object would handle it.

In the case of buffering for the sake of the underlying resource, like a web-based API, or your example of DB access, I think the approach you take will need to be situation-specific. There's unlikely to be a great general way to achieve it.


One of those approaches is to split the operation into 2 (maybe more?) functions.

For example I write some functions to send metrics to Graphite. I split the functions up between Format-Graphite and Out-Graphite; the former generates a properly formatted metric string, based on the parameters and the pipeline, while the latter sends string(s) to the Graphite collector. It allows the client code to be a bit more versatile in how it gets and generates it data because it can pipe to Format-Graphite, or make individual calls to it without worrying that the network portion will be inefficient. The client code does not have to deal with manually collecting its own data just to avoid that. Not the best example without code to demo but I can't post that code right now.


Another approach for things where the "expensive" part of a single operation is the initialization and tear down code is to just do that stuff in Begin and End and then use Process normally.

For example making dozens of database calls over a single connection you established in Begin may not be so bad, may even be preferable to doing something like building a big SQL string and sending it all at once.


Ultimately I think you might be better off looking at each individual use case and determining the best approach for your needs, balancing performance/efficiency and ease/intuitiveness of invoking the code.

If you have a specific use case and post a question about that, I'd like to read it; send me a link.

like image 22
briantist Avatar answered Sep 21 '22 01:09

briantist