I need to split strings of data using each character from <code>string.punctuation</code> and <code>string.whitespace</code> as a separator. Furthermore, I need for the separators to remain in the output list, in between the items they separated in the string. For example, <pre class="prettyprint"><code>"Now is the winter of our discontent" </code></pre> should output: <pre class="prettyprint"><code>['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent'] </code></pre> I'm not sure how to do this without resorting to an orgy of nested loops, which is unacceptably slow. How can I do it?

A different non-regex approach from the others: <pre class="prettyprint"><code>>>> import string >>> from itertools import groupby >>> >>> special = set(string.punctuation + string.whitespace) >>> s = "One two three tab\ttabandspace\t end" >>> >>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)] >>> split_combined ['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end'] >>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)] >>> split_separated ['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end'] </code></pre> Could use <code>dict.fromkeys</code> and <code>.get</code> instead of the <code>lambda</code>, I guess. [edit] Some explanation: <code>groupby</code> accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction: <pre class="prettyprint"><code>>>> groupby("sentence", lambda c: c in 'nt') <itertools.groupby object at 0x9805af4> >>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')] [(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])] </code></pre> where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.) As @JonClements guessed, what I had in mind was <pre class="prettyprint"><code>>>> special = dict.fromkeys(string.punctuation + string.whitespace, True) >>> s = "One two three tab\ttabandspace\t end" >>> [''.join(g) for k,g in groupby(s, special.get)] ['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end'] </code></pre> for the case where we were combining the separators. <code>.get</code> returns <code>None</code> if the value isn't in the dict.

<pre class="prettyprint"><code>import re import string p = re.compile("[^{0}]+|[{0}]+".format(re.escape( string.punctuation + string.whitespace))) print p.findall("Now is the winter of our discontent") </code></pre> I'm no big fan of using regexps for all problems, but I don't think you have much choice in this if you want it fast and short. I'll explain the regexp since you're not familiar with it: <ul> <li> <code>[...]</code> means any of the characters inside the square brackets</li> <li> <code>[^...]</code> means any of the characters not inside the square brackets</li> <li> <code>+</code> behind means one or more of the previous thing </li> <li> <code>x|y</code> means to match either <code>x</code> or <code>y</code> </li> </ul> So the regexp matches 1 or more characters where either all must be punctuation and whitespace, or none must be. The <code>findall</code> method finds all non-overlapping matches of the pattern.

Efficiently split a string using multiple separators and retaining each separator?

Tags:

I need to split strings of data using each character from string.punctuation and string.whitespace as a separator.

Furthermore, I need for the separators to remain in the output list, in between the items they separated in the string.

For example,

"Now is the winter of our discontent"

should output:

['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']

I'm not sure how to do this without resorting to an orgy of nested loops, which is unacceptably slow. How can I do it?

260

asked Nov 01 '12 21:11

Louis Thibault

2 Answers

A different non-regex approach from the others:

>>> import string >>> from itertools import groupby >>>  >>> special = set(string.punctuation + string.whitespace) >>> s = "One two  three    tab\ttabandspace\t end" >>>  >>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)] >>> split_combined ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t ', 'end'] >>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)] >>> split_separated ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']

Could use dict.fromkeys and .get instead of the lambda, I guess.

[edit]

Some explanation:

groupby accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction:

>>> groupby("sentence", lambda c: c in 'nt') <itertools.groupby object at 0x9805af4> >>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')] [(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]

where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.)

As @JonClements guessed, what I had in mind was

>>> special = dict.fromkeys(string.punctuation + string.whitespace, True) >>> s = "One two  three    tab\ttabandspace\t end" >>> [''.join(g) for k,g in groupby(s, special.get)] ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t ', 'end']

for the case where we were combining the separators. .get returns None if the value isn't in the dict.

148

answered Oct 07 '22 21:10

DSM

import re import string  p = re.compile("[^{0}]+|[{0}]+".format(re.escape(     string.punctuation + string.whitespace)))  print p.findall("Now is the winter of our discontent")

I'm no big fan of using regexps for all problems, but I don't think you have much choice in this if you want it fast and short.

I'll explain the regexp since you're not familiar with it:

[...] means any of the characters inside the square brackets
[^...] means any of the characters not inside the square brackets
+ behind means one or more of the previous thing
x|y means to match either x or y

So the regexp matches 1 or more characters where either all must be punctuation and whitespace, or none must be. The findall method finds all non-overlapping matches of the pattern.

answered Oct 07 '22 21:10

Lauritz V. Thaulow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently split a string using multiple separators and retaining each separator?

Tags:

Louis Thibault

People also ask

2 Answers

DSM

Lauritz V. Thaulow

Recent Activity

Donate For Us

Efficiently split a string using multiple separators and retaining each separator?

Tags:

Louis Thibault

People also ask

2 Answers

DSM

Lauritz V. Thaulow

Related questions

Recent Activity

Donate For Us