I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc"
.
The number of the chars can differ and sometimes there can be dash inside the string, like "aaaaa-bbbbbbbbbbbbbbccccccccccc"
.
Is there any smart way to either split it "aaaaa"
,"bbbbbbbbbbbbbb"
,"ccccccccccc"
and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same.
Any idea?
If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter. An example of split by consecutive numbers is as follows.
Regex example to split a string into words In this example, we will split the target string at each white-space character using the \s special sequence. Let's add the + metacharacter at the end of \s . Now, The \s+ regex pattern will split the target string on the occurrence of one or more whitespace characters.
Python String split() MethodThe split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.
Regular expression MatchObject
results include indices of the match. What remains is to match repeating characters:
import re
repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?')
would match only if a given letter character (a
-z
) is repeated at least once:
>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"):
... print match.group(), match.start(), match.end()
...
aaaaa 0 5
bbbbbbbbbbbbbb 5 19
ccccccccccc 19 30
The .start()
and .end()
methods on the match result give you the exact positions in the input string.
Dashes are included in the matches, but not non-repeating characters:
>>> for match in repeat.finditer("a-bb-cccccccc"):
... print match.group(), match.start(), match.end()
...
bb- 2 5
cccccccc 5 13
If you want the a-
part to be a match, simply replace the +
with a *
multiplier:
repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?')
What about using itertools.groupby
?
>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc'
>>> from itertools import groupby
>>> [''.join(v) for k,v in groupby(s)]
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']
This will put the -
as their own substrings which could easily be filtered out.
>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc'
>>> [''.join(v) for k,v in groupby(s) if k != '-']
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With