I need to split strings of data using each character from string.punctuation
and string.whitespace
as a separator.
Furthermore, I need for the separators to remain in the output list, in between the items they separated in the string.
For example,
"Now is the winter of our discontent"
should output:
['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
I'm not sure how to do this without resorting to an orgy of nested loops, which is unacceptably slow. How can I do it?
Use the String. split() method to split a string with multiple separators, e.g. str. split(/[-_]+/) . The split method can be passed a regular expression containing multiple characters to split the string with multiple separators.
To split a string by multiple spaces, call the split() method, passing it a regular expression, e.g. str. trim(). split(/\s+/) . The regular expression will split the string on one or more spaces and return an array containing the substrings.
A different non-regex approach from the others:
>>> import string >>> from itertools import groupby >>> >>> special = set(string.punctuation + string.whitespace) >>> s = "One two three tab\ttabandspace\t end" >>> >>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)] >>> split_combined ['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end'] >>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)] >>> split_separated ['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']
Could use dict.fromkeys
and .get
instead of the lambda
, I guess.
[edit]
Some explanation:
groupby
accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction:
>>> groupby("sentence", lambda c: c in 'nt') <itertools.groupby object at 0x9805af4> >>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')] [(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]
where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.)
As @JonClements guessed, what I had in mind was
>>> special = dict.fromkeys(string.punctuation + string.whitespace, True) >>> s = "One two three tab\ttabandspace\t end" >>> [''.join(g) for k,g in groupby(s, special.get)] ['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
for the case where we were combining the separators. .get
returns None
if the value isn't in the dict.
import re import string p = re.compile("[^{0}]+|[{0}]+".format(re.escape( string.punctuation + string.whitespace))) print p.findall("Now is the winter of our discontent")
I'm no big fan of using regexps for all problems, but I don't think you have much choice in this if you want it fast and short.
I'll explain the regexp since you're not familiar with it:
[...]
means any of the characters inside the square brackets[^...]
means any of the characters not inside the square brackets+
behind means one or more of the previous thing x|y
means to match either x
or y
So the regexp matches 1 or more characters where either all must be punctuation and whitespace, or none must be. The findall
method finds all non-overlapping matches of the pattern.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With