I want to split a string on any combination of delimiters I provide. For example, if the string is:
s = 'This, I think,., کباب MAKES , some sense '
And the delimiters are \.
, ,
, and \s
. However I want to capture all delimiters except whitespace \s
. The output should be:
['This', ',', 'I', 'think', ',.,', 'کباب', 'MAKES', ',', 'some', 'sense']
My solution so far is is using the re
module:
pattern = '([\.,\s]+)'
re.split(pattern, s)
However, this captures whitespace as well. I have tried using other patterns like [(\.)(,)\s]+
but they don't work.
Edit: @PadraicCunningham made an astute observation. For delimiters like Some text ,. , some more text
, I'd only want to remove leading and trailing whitespace from ,. ,
and not whitespace within.
Summary: To split a string and keep the delimiters/separators you can use one of the following methods: Use a regex module and the split() method along with \W special character. Use a regex module and the split() method along with a negative character set [^a-zA-Z0-9] .
You can split a string by each character using an empty string('') as the splitter. In the example below, we split the same message using an empty string. The result of the split will be an array containing all the characters in the message string.
Using the STUFF & FOR XML PATH function we can derive the input string lists into an XML format based on delimiter lists. And finally we can load the data into the temp table..
split() method accepts two arguments. The first optional argument is separator , which specifies what kind of separator to use for splitting the string. If this argument is not provided, the default value is any whitespace, meaning the string will split whenever .
The following approach would be the most simple one, I suppose ...
s = 'This, I think,., کباب MAKES , some sense '
pattern = '([\.,\s]+)'
splitted = [i.strip() for i in re.split(pattern, s) if i.strip()]
The output:
['This', ',', 'I', 'think', ',.,', 'کباب', 'MAKES', ',', 'some', 'sense']
NOTE: According to the new edit on the question, I've improved my old regex. The new one is quite long but trust me, it's work!
I suggest a pattern below as a delimiter of the function re.split()
:
(?<![,\.\ ])(?=[,\.]+)|(?<=[,\.])(?![,\.\ ])|(?<=[,\.])\ +(?![,\.\ ])|(?<![,\.\ ])\ +(?=[,\.][,\.\ ]+)|(?<![,\.\ ])\ +(?![,\.\ ])
My workaround here doesn't require any pre/post space modification. The thing that make regex work is about how you order the regex expressions with or
. My cursory strategy is any patterns that dealing with a space-leading will be evaluated last.
See DEMO
Additional
According to @revo's comment he provided an another shorten version of mine which is
\s+(?=[^.,\s])|\b(?:\s+|(?=[,.]))|(?<=[,.])\b
See DEMO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With