I want to split strings only once based on multiple delimiters like "and", "&" and "-" in that order. Example:
'121 34 adsfd' -> ['121 34 adsfd']
'dsfsd and adfd' -> ['dsfsd ', ' adfd']
'dsfsd & adfd' -> ['dsfsd ', ' adfd']
'dsfsd - adfd' -> ['dsfsd ', ' adfd']
'dsfsd and adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
'dsfsd and adfd - adsfa' -> ['dsfsd ', ' adfd - adsfa']
'dsfsd - adfd and adsfa' -> ['dsfsd - adfd ', ' adsfa']
I tried the below code to achieve this:
import re
re.split('and|&|-', string, maxsplit=1)
It works for all cases except the last one. As it does not follow the hierarchy, for the last one it returns:
'dsfsd - adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
How can I achieve this?
There are multiple ways you can split a string or strings of multiple delimiters in python. The most and easy approach is to use the split() method, however, it is meant to handle simple cases. re. split() is more flexible than the normal `split()` method in handling complex string scenarios.
To split a string with multiple delimiters in Python, use the re. split() method. The re. split() function splits the string by each occurrence of the pattern.
Use the String. split() method to split a string with multiple separators, e.g. str. split(/[-_]+/) . The split method can be passed a regular expression containing multiple characters to split the string with multiple separators.
Split String in Python. To split a String in Python with a delimiter, use split() function. split() function splits the string into substrings and returns them as an array.
This would be impractical with a single regular expression. You could get it to work with negative lookbehinds, but it would get quite complicated with each additional delimiter. It's pretty trivial to do this with plain old str.split()
and multiple lines. All you have to do is check if splitting with the current delimiter gives you two elements. If it does, that's your answer. If not, move on to the next delimiter:
def split_new(inp, delims):
for d in delims:
result = inp.split(d, maxsplit=1)
if len(result) == 2: return result
return [inp] # If nothing worked, return the input
To test this:
teststrs = ['121 34 adsfd' , 'dsfsd and adfd', 'dsfsd & adfd' , 'dsfsd - adfd' , 'dsfsd and adfd and adsfa' , 'dsfsd and adfd - adsfa' , 'dsfsd - adfd and adsfa' ]
for t in teststrs:
print(repr(t), '->', split_new(t, ['and', '&', '-']))
outputs
'121 34 adsfd' -> ['121 34 adsfd']
'dsfsd and adfd' -> ['dsfsd ', ' adfd']
'dsfsd & adfd' -> ['dsfsd ', ' adfd']
'dsfsd - adfd' -> ['dsfsd ', ' adfd']
'dsfsd and adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
'dsfsd and adfd - adsfa' -> ['dsfsd ', ' adfd - adsfa']
'dsfsd - adfd and adsfa' -> ['dsfsd - adfd ', ' adsfa']
Try:
import re
tests = [
["121 34 adsfd", ["121 34 adsfd"]],
["dsfsd and adfd", ["dsfsd ", " adfd"]],
["dsfsd & adfd", ["dsfsd ", " adfd"]],
["dsfsd - adfd", ["dsfsd ", " adfd"]],
["dsfsd and adfd and adsfa", ["dsfsd ", " adfd and adsfa"]],
["dsfsd and adfd - adsfa", ["dsfsd ", " adfd - adsfa"]],
["dsfsd - adfd and adsfa", ["dsfsd - adfd ", " adsfa"]],
]
for s, result in tests:
res = re.split(r"and|&(?!.*and)|-(?!.*and|.*&)", s, maxsplit=1)
print(res)
assert res == result
Prints:
['121 34 adsfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd and adsfa']
['dsfsd ', ' adfd - adsfa']
['dsfsd - adfd ', ' adsfa']
Explanation:
The regex and|&(?!.*and)|-(?!.*and|.*&)
uses 3 alternatives.
and
always or:&
only if there isn't and
ahead (using the negative look-ahead (?! )
or:-
only if there isn't and
or &
ahead.We're using this pattern in re.sub
-> splitting only on the first match.
You can keep a list of the delimiters, ordered by their value. Then, you can combine re.split
with re.findall
to only use delimiters produced from the latter that are the least valuable in the split, per the ranking in ops
:
import re
def split_order(s):
r, ops = re.findall('(?<=\s)and(?=\s)|\&|\-', s), ['and', '&', '-']
m = -1 if not r else min([ops.index(i) for i in r])
a, *b = re.split('|'.join(l:=[i for i in r if ops.index(i) == m]), s)
return [s] if not l else ([a] if not b else [a, s[len(a)+len(l[0]):]])
vals = ['121 34 adsfd' , 'dsfsd and adfd', 'dsfsd & adfd' , 'dsfsd - adfd' , 'dsfsd and adfd and adsfa' , 'dsfsd and adfd - adsfa' , 'dsfsd - adfd and adsfa' ]
for i in vals:
print(split_order(i))
Output:
['121 34 adsfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd and adsfa']
['dsfsd ', ' adfd - adsfa']
['dsfsd - adfd ', ' adsfa']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With