I'm processing strings like this: "125A12C15"
I need to split them at boundaries between letters and numbers, e.g. this one should become ["125","A","12","C","15"]
.
Is there a more elegant way to do this in Python than going through it position by position and checking whether it's a letter or a number, and then concatenating accordingly? E.g. a built-in function or module for this kind of thing?
Thanks for any pointers!
The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
Use itertools.groupby
together with str.isalpha
method:
Docstring:
groupby(iterable[, keyfunc]) -> create an iterator which returns (key, sub-iterator) grouped by each value of key(value).
Docstring:
S.isalpha() -> bool
Return True if all characters in S are alphabetic and there is at least one character in S, False otherwise.
In [1]: from itertools import groupby
In [2]: s = "125A12C15"
In [3]: [''.join(g) for _, g in groupby(s, str.isalpha)]
Out[3]: ['125', 'A', '12', 'C', '15']
Or possibly re.findall
or re.split
from the regular expressions module:
In [4]: import re
In [5]: re.findall('\d+|\D+', s)
Out[5]: ['125', 'A', '12', 'C', '15']
In [6]: re.split('(\d+)', s) # note that you may have to filter out the empty
# strings at the start/end if using re.split
Out[6]: ['', '125', 'A', '12', 'C', '15', '']
In [7]: re.split('(\D+)', s)
Out[7]: ['125', 'A', '12', 'C', '15']
As for the performance, it seems that using a regex is probably faster:
In [8]: %timeit re.findall('\d+|\D+', s*1000)
100 loops, best of 3: 2.15 ms per loop
In [9]: %timeit [''.join(g) for _, g in groupby(s*1000, str.isalpha)]
100 loops, best of 3: 8.5 ms per loop
In [10]: %timeit re.split('(\d+)', s*1000)
1000 loops, best of 3: 1.43 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With