First of all the question is tagged with Python
and regex
but it is not really tied to those - an answer can be high level.
At the moment I'm splitting a string with multiple delimiters with the following pattern. There are actually more delimiting patterns and they are more complex but let's keep it simple and limit them to 2 characters - #
and *
:
parts = re.split('#|*', string)
Which such approach a string aaa#bbb*ccc#ddd
is split to 4 substrings aaa
, bbb
, ccc
, ddd
. But it is required to split either by a delimiter that occurs first in the string or by a delimiter that is most frequent in the string. aaa#bbb*ccc#ddd
should be split to aaa
, bbb*ccc
, ddd
and aaa*bbb#ccc*ddd
should be split to aaa
, bbb#ccc
, ddd
.
I know a straightforward way to achieve that - to find what delimiter occurs first or is the most frequent in a string and then split with that single delimiter. But the method has to be efficient and I'm wondering if it is possible to achieve that by a single regex expression. The question is mostly for splitting with the first occurance of the set of delimiters - for most frequent delimiter case almost for sure it will be required to calculate occurrence count in advance.
Update:
The question does not ask to split by first occurrence or most frequent delimiter simultaneously - any of this methods individually will be sufficient. I do understand that splitting by most frequent delimiter is not possible with regex without preliminary determination of the delimiter but I think there's a chance that splitting by first occurrence is possible with regex and lookahead without preparation made in advance.
Regex example to split a string into words In this example, we will split the target string at each white-space character using the \s special sequence. Let's add the + metacharacter at the end of \s . Now, The \s+ regex pattern will split the target string on the occurrence of one or more whitespace characters.
Use the str. split() method with maxsplit set to 1 to split a string on the first occurrence, e.g. my_str. split('-', 1) . The split() method only performs a single split when the maxsplit argument is set to 1 .
To split a JavaScript string only on the first occurrence of a character, call the slice() method on the string, passing it the index of the character + 1 as a parameter. The slice method will return the portion of the string after the first occurrence of the character.
You do not only have to use literal strings for splitting strings into an array with the split method. You can use regex as breakpoints that match more characters for splitting a string.
it is required to split either by a delimiter that occurs first in the string or by a delimiter that is most frequent in the string.
So you can first find all the delimiters and preserve them in a proper container with their frequency, then find the most common and first one, then split your string based on them.
Now for finding the delimiters, you need to separate them from the plain text based on a particular feature, for example if they are none word characters, and for preserving them we can use a dictionary in order to preserve the count of similar delimiters (in this case collections.Counter()
will do the job).
Demo:
>>> s = "aaa#bbb*ccc#ddd*rkfh^ndjfh*dfehb*erjg-rh@fkej*rjh"
>>> delimiters = re.findall(r'\W', s)
>>> first = delimiters[0]
'#'
>>> Counter(delimiters)
Counter({'*': 5, '#': 2, '@': 1, '-': 1, '^': 1})
>>>
>>> frequent = Counter(delimiters).most_common(1)[0][0]
'*'
>>> re.split(r'\{}|\{}'.format(first, frequent), s)
['aaa', 'bbb', 'ccc', 'ddd', 'rkfh^ndjfh', 'dfehb', 'erjg-rh@fkej', 'rjh']
Note that if you are dealing with delimiters that are more than one characters you can use re.escape()
in order to escape the special regex characters (like *
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With