I started with Project Gutenberg's "The Complete Works of William Shakespeare by William Shakespeare", a UTF-8 text file available from http://www.gutenberg.org/ebooks/100. In PowerShell, I ran
Get-Content -Tail 50 $filename | Sort-Object -CaseSensitive
which - I believe - piped the last 50 lines (i.e., strings delimited by line breaks) of the file to Sort-Object
, which was configured to sort alphabetically with strings beginning with lowercase letters before strings beginning with uppercase letters.
Why is the output in the following image (especially in the P's) not sorting according to the -CaseSensitive
switch? What is a solution?
Note: This answer focuses on the general case of sorting entire strings (by all of their characters, not just by the first one).
You're looking for ordinal sorting, where characters are sorted numerically by their Unicode code point ("ASCII value") and therefore all uppercase letters, as a group, sort before all lowercase letters.
As of Windows PowerShell v5.1 / PowerShell Core v7.0, Sort-Object
invariably uses lexical sorting[1] (using the invariant culture by default, but this can be changed with the -Culture
parameter), where case-sensitive sorting simply means that the lowercase form of a given letter comes directly before its uppercase form, not all letters collectively; e.g., b
sorts before B
, but they both come after both a
and A
(also, the logic is reversed from the ordinal case, where it is uppercase letters that come first):
PS> 'B', 'b', 'A', 'a' | Sort-Object -CaseSensitive
a
A
b
B
There is a workaround, however, which (a) sorts uppercase letters before lowercase ones and (b) comes at the expense of performance:
Sort-Object
to also support ordinal sorting is being discussed in this GitHub issue.# PSv4+ syntax
# Note: Uppercase letters come first.
PS> 'B', 'b', 'A', 'a' |
Sort-Object { -join ([int[]] $_.ToCharArray()).ForEach('ToString', 'x4') }
A
B
a
b
The solution maps each input string to a string composed of the 4-digit hex. representations of the characters' code points, e.g. 'aB'
becomes '00610042'
, representing code points 0x61
and 0x42
; comparing these representations is then equivalent to sorting the string by its characters' code points.
# Get the last 50 lines as a list.
[Collections.Generic.List[string]] $lines = Get-Content -Tail 50 $filename
# Sort the list in place, using ordinal sorting
$lines.Sort([StringComparer]::Ordinal)
# Output the result.
# Note that uppercase letters come first.
$lines
[StringComparer]::Ordinal
returns an object that implements the [System.Collections.IComparer]
interface.
Using this solution in a pipeline is possible, but requires sending the array of lines as a single object through the pipeline, which the -ReadCount
parameter provides:
Get-Content -Tail 50 $filename -ReadCount 0 | ForEach-Object {
($lines = [Collections.Generic.List[string]] $_).Sort([StringComparer]::Ordinal)
$lines # output the sorted lines
}
Note: As stated, this sorts uppercase letters first.
To sort all lowercase letters first, you need to implement custom sorting by way of a [System.Comparison[string]]
delegate, which in PowerShell can be implemented as a script block ({ ... }
) that accepts two input strings and returns their sorting ranking (-1
(or any negative value) for less-than, 0
for equal, 1
(or any positive value) for greater-than):
$lines.Sort({ param([string]$x, [string]$y)
# Determine the shorter of the two lengths.
$count = if ($x.Length -lt $y.Length) { $x.Length } else { $y.Length }
# Loop over all characters in corresponding positions.
for ($i = 0; $i -lt $count; ++$i) {
if ([char]::IsLower($x[$i]) -ne [char]::IsLower($y[$i])) {
# Sort all lowercase chars. before uppercase ones.
return (1, -1)[[char]::IsLower($x[$i])]
} elseif ($x[$i] -ne $y[$i]) { # compare code points (numerically)
return $x[$i] - $y[$i]
}
# So far the two strings compared equal, continue.
}
# The strings compared equal in all corresponding character positions,
# so the difference in length, if any, is the decider (longer strings sort
# after shorter ones).
return $x.Length - $y.Length
})
Note: For English text, the above should work fine, but in order to be support all Unicode text potentially containing surrogate code-unit pairs and differing normalization forms (composed vs. decomposed accented characters), even more work is needed.
[1] On Windows, so-called word sorting is performed by default: "Certain non-alphanumeric characters might have special weights assigned to them. For example, the hyphen (-
) might have a very small weight assigned to it so that coop
and co-op
appear next to each other in a sorted list."; on Unix-like platforms, string sorting is the default, where no special weights apply to non-alphanumeric chars. - see the docs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With