Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a string with spaces on a new line in PowerShell

Tags:

powershell

I'm working on a PowerShell script where I take an input of a long string (from a CSV file) in the format:

Group One Name
Group Two Name
Group Three Name
...

I'm trying to parse it with

($entry.'Group Name').split("`n ") | %{
    if ($_) {
        # Do something with the group name
        $_
    }
}

I want to get output like:

Group One Name
Group Two Name
Group Three Name
...

But it comes out as:

Group
One
Name
Group
Two
...

like image 683
Caspian Peavyhouse Avatar asked Dec 10 '22 13:12

Caspian Peavyhouse


1 Answers

By accepting Bacon Bits' helpful answer you've indicated that it solves your problem, but that still leaves the question what you meant to happen when you passed "`n " - i.e., a 2-character PowerShell string - to the [string] class's .Split() method.

This answer makes the case for routinely using PowerShell's own -split operator instead of the .Split() method, because it:

  • uses regular PowerShell operator syntax
  • offers many more features
  • has fewer surprises
  • provides long-term behavioral stability

There are key differences between -split and the .Split() method:

  • By default, -split uses regular expressions to specify the split criterion; use the 'SimpleMatch' option as the 3rd RHS argument to use literal strings instead; by contrast, the .Split() method only accepts literal strings.

  • There's also a unary form of -split that splits by any runs of whitespace and ignores leading and trailing whitespace, similar to awk's default behavior; this is equivalent to calling '...'.Split([string[]] $null, 'RemoveEmptyEntries')

  • -split is case-insensitive by default (as is typical in PowerShell); use the -csplit form for case-sensitive matching; by contrast, .Split() is invariably case-sensitive.

  • -split accepts an array-valued LHS, returning a concatenation of the token arrays resulting from splitting the LHS's elements.

  • -split implicitly converts the LHS to string(s); by contrast, .Split() can only be invoked on something that already is a [string].

Note: Both -split and .Split() allow you to limit the number of tokens returned with an optional 2nd argument, which only splits part of the input string, reporting the rest of the input string in the last element of the return array.

For the full story, see Get-Help about_Split.

The .Split() method has one advantage, though: it is faster than the -split operator; so, if .Split()'s features are sufficient in a given scenario, you can speed things up with it.

Examples:

Note: In the examples below that use regular expressions, single-quoted strings are used, with LF characters represented as regular-expression escape sequence \n rather than the `n escape sequences PowerShell supports in any double-quoted strings, because it is preferable to specify regular expressions as single-quoted strings, to avoid confusion between what PowerShell expands up front and what -split ends up seeing.

  • Split by any in a set of characters, as a regular expression: "`n" (LF) and also " " (single space):

    • "one two`n three four" -split '[\n ]' yields the equivalent of
      @( 'one', 'two', '', 'three', 'four' )
  • Split by a string, specified as a regular expression: "`n ":

    • "one two`n three four" -split '\n ' yields the equivalent of
      @( 'one two', 'three four' )
  • Split by a string literal: "`n ", using the SimpleMatch option:

    • "one two`n three four" -split "`n ", 0, 'SimpleMatch' yields the same as above; note that 0 is the number-of-tokens-to-return argument, which must be specified for syntax reasons here; 0 indicates that all tokens should be returned.
  • Use capture groups ((...)) in the separator regex to include (parts of) separators in the result array:

    • 'a/b' -split '(/)' yields the equivalent of @('a', '/', 'b')
    • Alternatively, use a positive lookahead assertion ((?=...)) to make the separators part of the elements: 'a/b/c' -split '(?=/)' yields the equivalent of
      @( 'a', '/b', '/c' )
  • Limit the number of tokens:

    • 'one two three four' -split ' ', 3 yields the equivalent of
      @( 'one', 'two', 'three four' ), i.e., the 3rd token received the remainder of the input string.

    • Caveat: elements that are (parts of) separators captured via a capture group in the separator regex do not count toward the specified limit; e.g.,
      'a/b/c' -split '(/)', 2 yields @( 'a', '/', 'b/c' ), i.e. 3 elements in total.

  • Split by any run of whitespace (unary form):

    • -split "`n one `n`n two `t `t three`n`n" yields the equivalent of
      @( 'one', 'two', 'three' )

String.Split()-method pitfalls:

Having access to the .NET Framework's method if needed is a wonderful option that allows you to do in PowerShell most of what compiled .NET languages can do.
However, there are things that PowerShell has to do behind the scenes that are typically helpful, but can also be pitfalls:

For instance, 'foo'.Split("`n ") causes PowerShell to implicitly convert the string "`n " to a character array ([char[]]) before calling .Split() (the closest match among the method overloads), which may be unexpected.

Your intent may have been to split by string "`n ", but the method overload invoked ended up interpreting your string as a a set of individual characters any one of which to split the input by.

Incidentally, the cross-platform PowerShell Core edition has an additional .Split() overload that does now directly take a [string] argument, so the same call behaves differently there.

This changing behavior outside the control of PowerShell is in itself a good reason to prefer PowerShell-only solutions - for an explanation of why such changes are outside PowerShell's control, see this GitHub issue.

You can avoid such pitfalls by explicit typing, but that is both cumbersome and easy to forget.
Case in point:

In Windows PowerShell, if you truly wanted to split by string "`n ", this is what you'd need to do:

PS> "one`n two".Split([string[]] "`n ", 'None')
one
two

Note the necessary cast to [string[]] - even though only one string is passed - and the required use of the option parameter (None).

Conversely, if you wanted to split by a set of characters in PowerShell Core:

PS> "one`ntwo three".Split([char[]] "`n ")
one
two
three

Without the [char[]] cast, "`n " would be considered a single string to split by.

like image 178
mklement0 Avatar answered Jan 19 '23 07:01

mklement0