Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort in alphabetical order with lowercase before uppercase?

I started with Project Gutenberg's "The Complete Works of William Shakespeare by William Shakespeare", a UTF-8 text file available from http://www.gutenberg.org/ebooks/100. In PowerShell, I ran

Get-Content -Tail 50 $filename | Sort-Object -CaseSensitive

which - I believe - piped the last 50 lines (i.e., strings delimited by line breaks) of the file to Sort-Object, which was configured to sort alphabetically with strings beginning with lowercase letters before strings beginning with uppercase letters.

Why is the output in the following image (especially in the P's) not sorting according to the -CaseSensitive switch? What is a solution?

Link to Sort-Output Picture

like image 597
Tom Lever Avatar asked Dec 23 '22 06:12

Tom Lever


1 Answers

Note: This answer focuses on the general case of sorting entire strings (by all of their characters, not just by the first one).

You're looking for ordinal sorting, where characters are sorted numerically by their Unicode code point ("ASCII value") and therefore all uppercase letters, as a group, sort before all lowercase letters.

As of Windows PowerShell v5.1 / PowerShell Core v7.0, Sort-Object invariably uses lexical sorting[1] (using the invariant culture by default, but this can be changed with the -Culture parameter), where case-sensitive sorting simply means that the lowercase form of a given letter comes directly before its uppercase form, not all letters collectively; e.g., b sorts before B, but they both come after both a and A (also, the logic is reversed from the ordinal case, where it is uppercase letters that come first):

PS> 'B', 'b', 'A', 'a' | Sort-Object -CaseSensitive
a
A
b
B

There is a workaround, however, which (a) sorts uppercase letters before lowercase ones and (b) comes at the expense of performance:

  • For better performance via direct ordinal sorting you need to use the .NET framework directly - see below, which also offers a solution to sort the lowercase letters first.
  • Enhancing Sort-Object to also support ordinal sorting is being discussed in this GitHub issue.
# PSv4+ syntax
# Note: Uppercase letters come first.
PS> 'B', 'b', 'A', 'a' |
      Sort-Object { -join ([int[]] $_.ToCharArray()).ForEach('ToString', 'x4') } 
A
B
a
b

The solution maps each input string to a string composed of the 4-digit hex. representations of the characters' code points, e.g. 'aB' becomes '00610042', representing code points 0x61 and 0x42; comparing these representations is then equivalent to sorting the string by its characters' code points.


Use of .NET for direct, better-performing ordinal sorting:

# Get the last 50 lines as a list.
[Collections.Generic.List[string]] $lines = Get-Content -Tail 50 $filename

# Sort the list in place, using ordinal sorting
$lines.Sort([StringComparer]::Ordinal)

# Output the result.
# Note that uppercase letters come first.
$lines

[StringComparer]::Ordinal returns an object that implements the [System.Collections.IComparer] interface.

Using this solution in a pipeline is possible, but requires sending the array of lines as a single object through the pipeline, which the -ReadCount parameter provides:

Get-Content -Tail 50 $filename -ReadCount 0 | ForEach-Object { 
  ($lines = [Collections.Generic.List[string]] $_).Sort([StringComparer]::Ordinal)
  $lines # output the sorted lines 
}

Note: As stated, this sorts uppercase letters first.


To sort all lowercase letters first, you need to implement custom sorting by way of a [System.Comparison[string]] delegate, which in PowerShell can be implemented as a script block ({ ... }) that accepts two input strings and returns their sorting ranking (-1 (or any negative value) for less-than, 0 for equal, 1 (or any positive value) for greater-than):

$lines.Sort({ param([string]$x, [string]$y)
  # Determine the shorter of the two lengths.
  $count = if ($x.Length -lt $y.Length) { $x.Length } else { $y.Length }
  # Loop over all characters in corresponding positions.
  for ($i = 0; $i -lt $count; ++$i) {
    if ([char]::IsLower($x[$i]) -ne [char]::IsLower($y[$i])) {
      # Sort all lowercase chars. before uppercase ones.
      return (1, -1)[[char]::IsLower($x[$i])]
    } elseif ($x[$i] -ne $y[$i]) { # compare code points (numerically)
      return $x[$i] - $y[$i]
    }
    # So far the two strings compared equal, continue.
  }
  # The strings compared equal in all corresponding character positions,
  # so the difference in length, if any, is the decider (longer strings sort
  # after shorter ones).
  return $x.Length - $y.Length
})

Note: For English text, the above should work fine, but in order to be support all Unicode text potentially containing surrogate code-unit pairs and differing normalization forms (composed vs. decomposed accented characters), even more work is needed.


[1] On Windows, so-called word sorting is performed by default: "Certain non-alphanumeric characters might have special weights assigned to them. For example, the hyphen (-) might have a very small weight assigned to it so that coop and co-op appear next to each other in a sorted list."; on Unix-like platforms, string sorting is the default, where no special weights apply to non-alphanumeric chars. - see the docs.

like image 155
mklement0 Avatar answered Jan 14 '23 07:01

mklement0