Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting on regex without removing delimiters

So, I would like to split this text into sentences.

s = "You! Are you Tom? I am Danny."

so I get:

["You!", "Are you Tom?", "I am Danny."]

That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?

I am aware of these questions:

JS string.split() without removing the delimiters

Python split() without removing the delimiter

But my problem has various delimiters (.?!) which complicates the problem.

like image 752
GA1 Avatar asked May 29 '17 14:05

GA1


People also ask

Can you split with regex?

Split by regex: re. If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter.

How do you split a string by the occurrences of a regex pattern?

split() method split the string by the occurrences of the regex pattern, returning a list containing the resulting substrings.

Does string split remove delimiter?

split() is a powerful string manipulation tool that we have at our disposal, but it can also be destructive. Splitting will remove delimiters from the returned array, which may give us output that differs greatly from what the original string was.


2 Answers

You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:

import re

s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
like image 150
Psidom Avatar answered Sep 22 '22 20:09

Psidom


If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:

(?<=[.!?])

Demo: https://regex101.com/r/ZLDXr1/1

Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.

However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:

(?<=[.!?])\s+

Demo: https://regex101.com/r/ZLDXr1/2

Python demo: https://ideone.com/z6nZi5

If the spaces are optional, the re.findall solution suggested by @Psidom is the best one, I believe.

like image 32
Dmitry Egorov Avatar answered Sep 18 '22 20:09

Dmitry Egorov