Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best regex (in Python) to not have double space in result when substring is removed?

Tags:

python

regex

I am trying to remove a substring using regex in Python.

The substring could be the entire string, at the beginning, in the middle, or at the end.

The goal is that the resulting string should not have extra spaces where the substring existed.

Could you suggest a simple and efficient regex that achieves this?


Here are examples of the scenarios*, and my expected results:

  'before remove after' --> 'before after' (separated by single space)
  'remove after'        --> 'after'        (no space)
  'before remove'       --> 'before'       (no space)
  'remove'              --> ''             (no space, empty string)

* before, remove, and after may themselves internally contain any character (letters, numbers, spaces, etc.).


The regex should achieve the following:

  • If there was text before and text after the substring, the two parts should be separated by a single space.
  • If there was only text before the substring, the result should not have a space at the end.
  • If there was only text after the substring, the result should not have a space in the beginning.
  • If there was no text before and no text after the substring, the result should be an empty string.

Here are a couple of my attempts, but I could not get all scenarios to work...

  import re

  s1 = 'before remove after'
  s2 = 'remove after'
  s3 = 'before remove'
  s4 = 'remove'

  # (1) Just replace with empty string ''...

  re.sub(r'remove', '', s1)
  'before  after'  # <-- bad (two spaces in result)

  re.sub(r'remove', '', s2)
  ' after'         # <-- bad (space in the beginning)

  re.sub(r'remove', '', s3)
  'before '        # <-- bad (space at the end)

  re.sub(r'remove', '', s4)
  ''               # <-- good (empty string)

  # (2) Capture the "before" part excluding space suffixes,
  #     capture the "after" part excluding space prefixes,
  #     and recombine them with a single space...

  re.sub(r'(.*?)\s*remove\s*(.*?)', '\\1 \\2', s1)
  'before after'   # <-- good (single space)

  re.sub(r'(.*?)\s*remove\s*(.*?)', '\\1 \\2', s2)
  ' after'         # <-- bad (space in the beginning)

  re.sub(r'(.*?)\s*remove\s*(.*?)', '\\1 \\2', s3)
  'before '        # <-- bad (space at the end)

  re.sub(r'(.*?)\s*remove\s*(.*?)', '\\1 \\2', s4)
  ' '              # <-- bad (should be an empty string)
like image 734
Enterprise Avatar asked Dec 04 '25 17:12

Enterprise


2 Answers

try this :

import re

s ='before remove after'
s1 = 'remove after'    
s2 = 'before remove' 
s3 = 'remove'

print(re.sub(r"(remove\s?)|(\sremove)","",s))
print(re.sub(r"(remove\s?)|(\sremove)","",s1))
print(re.sub(r"(remove\s?)|(\sremove)","",s2))
print(re.sub(r"(remove\s?)|(\sremove)","",s3))

demo

like image 133
aziz k'h Avatar answered Dec 06 '25 07:12

aziz k'h


Using a pattern without a lambda, you could use a capturing group in the replacement. That group should contain either a single space when remove is surrounded by words, or an empty string when only remove surrounded by optional spaces.

(?:(?<=\S)( )+)? *remove *(?(1) (?=\S)(?!remove\b))

Explanation

  • (?: Non capture group
    • (?<=\S) Positive lookbehind, assert what is directly to the left is a non whitespace char
    • ( )+ Capture group 1, repeat 1+ times matching a space which captures only the value of the last iteration that we need in the replacement
  • )? Close non capture group and make it optional
  • *remove * Match remove between optional spaces
  • (?(1) (?=\S)(?!remove\b) If clause, it group 1 exists, match a space asserting what is directly to the right is a non whitespace char but not the word remove

Regex demo | Python demo

Example code

import re

strings = [
    'before remove after',
    'remove after',
    ' remove',
    'remove ',
    ' remove ',
    'before remove',
    'remove',
    'before   remove   after',
    'before remove     after remove before',
    'before remove after remove before remove',
    'before remove after remove before   remove   ',
    'after remove before before remove   remove remove',
    'remove remove    remove   '
]
pattern = r"(?:(?<=\S)( )+)? *remove *(?(1) (?=\S)(?!remove\b))"
for s in strings:
    print("'{0}' ==> '{1}'".format(s, re.sub(pattern, r"\1", s)))

Output (between single quotes to show the empty strings)

before remove after' ==> 'before after'
'remove after' ==> 'after'
' remove' ==> ''
'remove ' ==> ''
' remove ' ==> ''
'before remove' ==> 'before'
'remove' ==> ''
'before   remove   after' ==> 'before after'
'before remove     after remove before' ==> 'before after before'
'before remove after remove before remove' ==> 'before after before'
'before remove after remove before   remove   ' ==> 'before after before'
'after remove before before remove   remove remove' ==> 'after before before'
'remove remove    remove   ' ==> ''

Note

  • If you want to match whitespace chars that could possibly also match a newline, you can use \s instead of a space.

  • If you want to match whitespace chars without a newline instead of a space only, you can use [^\S\r\n]

like image 45
The fourth bird Avatar answered Dec 06 '25 07:12

The fourth bird