Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python split string by pattern

I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc". The number of the chars can differ and sometimes there can be dash inside the string, like "aaaaa-bbbbbbbbbbbbbbccccccccccc".

Is there any smart way to either split it "aaaaa","bbbbbbbbbbbbbb","ccccccccccc" and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same.

Any idea?

like image 919
Trollbrot Avatar asked Apr 18 '13 15:04

Trollbrot


People also ask

How do you split a pattern in Python?

If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter. An example of split by consecutive numbers is as follows.

How do you split a string by the occurrences of a regex pattern Python?

Regex example to split a string into words In this example, we will split the target string at each white-space character using the \s special sequence. Let's add the + metacharacter at the end of \s . Now, The \s+ regex pattern will split the target string on the occurrence of one or more whitespace characters.

How do you split two strings in Python?

Python String split() MethodThe split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.


2 Answers

Regular expression MatchObject results include indices of the match. What remains is to match repeating characters:

import re

repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?')

would match only if a given letter character (a-z) is repeated at least once:

>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"):
...     print match.group(), match.start(), match.end()
... 
aaaaa 0 5
bbbbbbbbbbbbbb 5 19
ccccccccccc 19 30

The .start() and .end() methods on the match result give you the exact positions in the input string.

Dashes are included in the matches, but not non-repeating characters:

>>> for match in repeat.finditer("a-bb-cccccccc"):
...     print match.group(), match.start(), match.end()
... 
bb- 2 5
cccccccc 5 13

If you want the a- part to be a match, simply replace the + with a * multiplier:

repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?')
like image 100
Martijn Pieters Avatar answered Oct 11 '22 18:10

Martijn Pieters


What about using itertools.groupby?

>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc'
>>> from itertools import groupby
>>> [''.join(v) for k,v in groupby(s)]
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

This will put the - as their own substrings which could easily be filtered out.

>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc'
>>> [''.join(v) for k,v in groupby(s) if k != '-']
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']
like image 40
mgilson Avatar answered Oct 11 '22 18:10

mgilson