I have strings like <code>"aaaaabbbbbbbbbbbbbbccccccccccc"</code>. The number of the chars can differ and sometimes there can be dash inside the string, like <code>"aaaaa-bbbbbbbbbbbbbbccccccccccc"</code>. Is there any smart way to either split it <code>"aaaaa"</code>,<code>"bbbbbbbbbbbbbb"</code>,<code>"ccccccccccc"</code> and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same. Any idea?

Regular expression <code>MatchObject</code> results include indices of the match. What remains is to match repeating characters: <pre class="prettyprint"><code>import re repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?') </code></pre> would match only if a given letter character (<code>a</code>-<code>z</code>) is repeated at least once: <pre class="prettyprint"><code>>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"): ... print match.group(), match.start(), match.end() ... aaaaa 0 5 bbbbbbbbbbbbbb 5 19 ccccccccccc 19 30 </code></pre> The <code>.start()</code> and <code>.end()</code> methods on the match result give you the exact positions in the input string. Dashes are included in the matches, but not non-repeating characters: <pre class="prettyprint"><code>>>> for match in repeat.finditer("a-bb-cccccccc"): ... print match.group(), match.start(), match.end() ... bb- 2 5 cccccccc 5 13 </code></pre> If you want the <code>a-</code> part to be a match, simply replace the <code>+</code> with a <code>*</code> multiplier: <pre class="prettyprint"><code>repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?') </code></pre>

What about using <code>itertools.groupby</code>? <pre class="prettyprint"><code>>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc' >>> from itertools import groupby >>> [''.join(v) for k,v in groupby(s)] ['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc'] </code></pre> This will put the <code>-</code> as their own substrings which could easily be filtered out. <pre class="prettyprint"><code>>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc' >>> [''.join(v) for k,v in groupby(s) if k != '-'] ['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc'] </code></pre>

Python split string by pattern

Tags:

python

string

regex

split

I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc". The number of the chars can differ and sometimes there can be dash inside the string, like "aaaaa-bbbbbbbbbbbbbbccccccccccc".

Is there any smart way to either split it "aaaaa","bbbbbbbbbbbbbb","ccccccccccc" and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same.

Any idea?

919

asked Apr 18 '13 15:04

Trollbrot

2 Answers

Regular expression MatchObject results include indices of the match. What remains is to match repeating characters:

import re

repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?')

would match only if a given letter character (a-z) is repeated at least once:

>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"):
...     print match.group(), match.start(), match.end()
... 
aaaaa 0 5
bbbbbbbbbbbbbb 5 19
ccccccccccc 19 30

The .start() and .end() methods on the match result give you the exact positions in the input string.

Dashes are included in the matches, but not non-repeating characters:

>>> for match in repeat.finditer("a-bb-cccccccc"):
...     print match.group(), match.start(), match.end()
... 
bb- 2 5
cccccccc 5 13

If you want the a- part to be a match, simply replace the + with a * multiplier:

repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?')

100

answered Oct 11 '22 18:10

Martijn Pieters

What about using itertools.groupby?

>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc'
>>> from itertools import groupby
>>> [''.join(v) for k,v in groupby(s)]
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

This will put the - as their own substrings which could easily be filtered out.

>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc'
>>> [''.join(v) for k,v in groupby(s) if k != '-']
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']

answered Oct 11 '22 18:10

mgilson

Related questions
                            
                                can one python script run both with python 2.x and python 3.x
                            
                                Can't Install PIL 1.7
                            
                                LLDB Python scripting in Xcode
                            
                                Event signal is emmitted twice every time
                            
                                can't multiply sequence by non-int of type 'list'
                            
                                How do I capture a screenshot if my nosetests fail?
                            
                                Is there OpenCV colormap in python?
                            
                                Alternative for 'in' operator for nested lists
                            
                                Fixing invalid JSON octal escape
                            
                                Bash style process substitution with Python's Popen
                            
                                How to store real-time chat messages in database?
                            
                                Align logging messages of different levels
                            
                                Python Script Failing to Execute from PHP exec()
                            
                                How to adjust the size of a GtkGrid cell?
                            
                                sqlite - works with file, dies with :memory:
                            
                                NumPy matrix plus column vector
                            
                                Python file objects, closing, and destructors
                            
                                AIML for Intelligent Answering Engine
                            
                                Call CMake from python script results in "Could not create named generator"
                            
                                How do you get the Python profiler to work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With