Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split up a string on multiple delimiters but only capture some?

Tags:

python

regex

I want to split a string on any combination of delimiters I provide. For example, if the string is:

s = 'This, I think,., کباب MAKES , some sense '

And the delimiters are \., ,, and \s. However I want to capture all delimiters except whitespace \s. The output should be:

['This', ',', 'I', 'think', ',.,', 'کباب', 'MAKES', ',', 'some', 'sense']

My solution so far is is using the re module:

pattern = '([\.,\s]+)'  
re.split(pattern, s)

However, this captures whitespace as well. I have tried using other patterns like [(\.)(,)\s]+ but they don't work.

Edit: @PadraicCunningham made an astute observation. For delimiters like Some text ,. , some more text, I'd only want to remove leading and trailing whitespace from ,. , and not whitespace within.

like image 228
hazrmard Avatar asked Sep 25 '16 19:09

hazrmard


People also ask

How do you split a string but keep the delimiters?

Summary: To split a string and keep the delimiters/separators you can use one of the following methods: Use a regex module and the split() method along with \W special character. Use a regex module and the split() method along with a negative character set [^a-zA-Z0-9] .

How do I split a string into multiple parts?

You can split a string by each character using an empty string('') as the splitter. In the example below, we split the same message using an empty string. The result of the split will be an array containing all the characters in the message string.

How do I split a string with multiple delimiters in SQL?

Using the STUFF & FOR XML PATH function we can derive the input string lists into an XML format based on delimiter lists. And finally we can load the data into the temp table..

Can split () take multiple arguments?

split() method accepts two arguments. The first optional argument is separator , which specifies what kind of separator to use for splitting the string. If this argument is not provided, the default value is any whitespace, meaning the string will split whenever .


2 Answers

The following approach would be the most simple one, I suppose ...

s = 'This, I think,., کباب MAKES , some sense '
pattern = '([\.,\s]+)'
splitted = [i.strip() for i in re.split(pattern, s) if i.strip()]

The output:

['This', ',', 'I', 'think', ',.,', 'کباب', 'MAKES', ',', 'some', 'sense']
like image 81
RomanPerekhrest Avatar answered Oct 14 '22 15:10

RomanPerekhrest


NOTE: According to the new edit on the question, I've improved my old regex. The new one is quite long but trust me, it's work!

I suggest a pattern below as a delimiter of the function re.split():

(?<![,\.\ ])(?=[,\.]+)|(?<=[,\.])(?![,\.\ ])|(?<=[,\.])\ +(?![,\.\ ])|(?<![,\.\ ])\ +(?=[,\.][,\.\ ]+)|(?<![,\.\ ])\ +(?![,\.\ ])

My workaround here doesn't require any pre/post space modification. The thing that make regex work is about how you order the regex expressions with or. My cursory strategy is any patterns that dealing with a space-leading will be evaluated last.

See DEMO

Additional

According to @revo's comment he provided an another shorten version of mine which is

\s+(?=[^.,\s])|\b(?:\s+|(?=[,.]))|(?<=[,.])\b

See DEMO

like image 22
fronthem Avatar answered Oct 14 '22 15:10

fronthem