Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string by the first occurance from a set of delimiters with Python and regex

First of all the question is tagged with Python and regex but it is not really tied to those - an answer can be high level.

At the moment I'm splitting a string with multiple delimiters with the following pattern. There are actually more delimiting patterns and they are more complex but let's keep it simple and limit them to 2 characters - # and *:

parts = re.split('#|*', string)

Which such approach a string aaa#bbb*ccc#ddd is split to 4 substrings aaa, bbb, ccc, ddd. But it is required to split either by a delimiter that occurs first in the string or by a delimiter that is most frequent in the string. aaa#bbb*ccc#ddd should be split to aaa, bbb*ccc, ddd and aaa*bbb#ccc*ddd should be split to aaa, bbb#ccc, ddd.

I know a straightforward way to achieve that - to find what delimiter occurs first or is the most frequent in a string and then split with that single delimiter. But the method has to be efficient and I'm wondering if it is possible to achieve that by a single regex expression. The question is mostly for splitting with the first occurance of the set of delimiters - for most frequent delimiter case almost for sure it will be required to calculate occurrence count in advance.

Update:

The question does not ask to split by first occurrence or most frequent delimiter simultaneously - any of this methods individually will be sufficient. I do understand that splitting by most frequent delimiter is not possible with regex without preliminary determination of the delimiter but I think there's a chance that splitting by first occurrence is possible with regex and lookahead without preparation made in advance.

like image 595
Andrey Grachev Avatar asked Aug 16 '16 14:08

Andrey Grachev


People also ask

How do you split a string by the occurrences of a regex pattern Python?

Regex example to split a string into words In this example, we will split the target string at each white-space character using the \s special sequence. Let's add the + metacharacter at the end of \s . Now, The \s+ regex pattern will split the target string on the occurrence of one or more whitespace characters.

How do you split a string with the first occurrence of a character in Python?

Use the str. split() method with maxsplit set to 1 to split a string on the first occurrence, e.g. my_str. split('-', 1) . The split() method only performs a single split when the maxsplit argument is set to 1 .

How do you split a string on the first occurrence of certain characters?

To split a JavaScript string only on the first occurrence of a character, call the slice() method on the string, passing it the index of the character + 1 as a parameter. The slice method will return the portion of the string after the first occurrence of the character.

Can we use regex in split a string?

You do not only have to use literal strings for splitting strings into an array with the split method. You can use regex as breakpoints that match more characters for splitting a string.


1 Answers

it is required to split either by a delimiter that occurs first in the string or by a delimiter that is most frequent in the string.

So you can first find all the delimiters and preserve them in a proper container with their frequency, then find the most common and first one, then split your string based on them.

Now for finding the delimiters, you need to separate them from the plain text based on a particular feature, for example if they are none word characters, and for preserving them we can use a dictionary in order to preserve the count of similar delimiters (in this case collections.Counter() will do the job).

Demo:

>>> s = "aaa#bbb*ccc#ddd*rkfh^ndjfh*dfehb*erjg-rh@fkej*rjh"
>>> delimiters = re.findall(r'\W', s)
>>> first = delimiters[0]
'#'
>>> Counter(delimiters)
Counter({'*': 5, '#': 2, '@': 1, '-': 1, '^': 1})
>>> 
>>> frequent = Counter(delimiters).most_common(1)[0][0]
'*'
>>> re.split(r'\{}|\{}'.format(first, frequent), s)
['aaa', 'bbb', 'ccc', 'ddd', 'rkfh^ndjfh', 'dfehb', 'erjg-rh@fkej', 'rjh']

Note that if you are dealing with delimiters that are more than one characters you can use re.escape() in order to escape the special regex characters (like *).

like image 164
Mazdak Avatar answered Oct 31 '22 16:10

Mazdak