Say I have a sorted list of strings as in:
['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
Now I want to sort based on the trailing numerical value for the B
s - so I have:
['A', 'B' , 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
One possible algorithm would be to hash up a regex like regex = re.compile(ur'(B)(\d*))
, find the indices of the first and last B
, slice the list, sort the slice using the regex's second group, then insert the sorted slice. However this seems too much of a hassle. Is there a way to write a key function that "leaves the item in place" if it does not match the regex and only
sorts the items (sublists) that match ?
Note: the above is just an example; I don't necessarily know the pattern (or I may want to also sort C's, or any string that has a trailing number in there). Ideally, I'm looking for an approach to the general problem of sorting only subsequences which match a given criterion (or failing that, just those that meet the specific criterion of a given prefix followed by a string of digits).
You can use the natsort module:
>>> from natsort import natsorted
>>>
>>> a = ['A', 'B' , 'B1', 'B11', 'B2', 'B21', 'B22', 'C', 'C1', 'C11', 'C2']
>>> natsorted(a)
['A', 'B', 'B1', 'B2', 'B11', 'B21', 'B22', 'C', 'C1', 'C2', 'C11']
In the simple case where you just want to sort trailing digits numerically and their non-digit prefixes alphabetically, you need a key function which splits each item into non-digit and digit components as follows:
'AB123' -> ['AB', 123]
'CD' -> ['CD']
'456' -> ['', 456]
Note: In the last case, the empty string
''
is not strictly necessary in CPython 2.x, as integers sort before strings – but that's an implementation detail rather than a guarantee of the language, and in Python 3.x it is necessary, because strings and integers can't be compared at all.
You can build such a key function using a list comprehension and re.split()
:
import re
def trailing_digits(x):
return [
int(g) if g.isdigit() else g
for g in re.split(r'(\d+)$', x)
]
Here it is in action:
>>> s1 = ['11', '2', 'A', 'B', 'B1', 'B11', 'B2', 'B21', 'C', 'C11', 'C2']
>>> sorted(s1, key=trailing_digits)
['2', '11', 'A', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'C2', 'C11']
Once you add the restriction that only strings with a particular prefix or prefixes have their trailing digits sorted numerically, things get a little more complicated.
The following function builds and returns a key function which fulfils the requirement:
def prefixed_digits(*prefixes):
disjunction = '|'.join('^' + re.escape(p) for p in prefixes)
pattern = re.compile(r'(?<=%s)(\d+)$' % disjunction)
def key(x):
return [
int(g) if g.isdigit() else g
for g in re.split(pattern, x)
]
return key
The main difference here is that a precompiled regex is created (containing a lookbehind constructed from the supplied prefix or prefixes), and a key function using that regex is returned.
Here are some usage examples:
>>> s2 = ['A', 'B', 'B11', 'B2', 'B21', 'C', 'C11', 'C2', 'D12', 'D2']
>>> sorted(s2, key=prefixed_digits('B'))
['A', 'B', 'B2', 'B11', 'B21', 'C', 'C11', 'C2', 'D12', 'D2']
>>> sorted(s2, key=prefixed_digits('B', 'C'))
['A', 'B', 'B2', 'B11', 'B21', 'C', 'C2', 'C11', 'D12', 'D2']
>>> sorted(s2, key=prefixed_digits('B', 'D'))
['A', 'B', 'B2', 'B11', 'B21', 'C', 'C11', 'C2', 'D2', 'D12']
If called with no arguments, prefixed_digits()
returns a key function which behaves identically to trailing_digits
:
>>> sorted(s1, key=prefixed_digits())
['2', '11', 'A', 'B', 'B1', 'B2', 'B11', 'B21', 'C', 'C2', 'C11']
Caveats:
Due to a restriction in Python's re
module regarding lookbhehind syntax, multiple prefixes must have the same length.
In Python 2.x, strings which are purely numeric will be sorted numerically regardless of which prefixes are supplied to prefixed_digits()
. In Python 3, they'll cause an exception (except when called with no arguments, or in the special case of key=prefixed_digits('')
– which will sort purely numeric strings numerically, and prefixed strings alphabetically). Fixing that may be possible with a significantly more complex regex, but I gave up trying after about twenty minutes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With