Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace the spaces between multiple (3+) capital letters

Tags:

regex

r

gsubfn

I have some text where people use capitals with spaces in between to make the substring standout. I want to replace the spaces between these substrings. The rules for the pattern is: "at least 3 consecutive capital letters with a space between each letter".

I'm curious how to do this with pure regex but also with the gsubfn package as I thought this would be an easy job for it but in MWE example below I crashed and burned as an extra letter was placed in there (I'm curious why this is happening).

MWE

x <- c(
    'Welcome to A I: the best W O R L D!',
    'Hi I R is the B O M B for sure: we A G R E E indeed.'
)

## first to show I have the right regex pattern
gsub('(([A-Z]\\s+){2,}[A-Z])', '<FOO>', x)
## [1] "Welcome to A I: the best <FOO>!"               
## [2] "Hi I R is the <FOO> for sure: we <FOO> indeed."

library(gsubfn)
spacrm1 <- function(string) {gsub('\\s+', '', string)}
gsubfn('(([A-Z]\\s+){2,}[A-Z])', spacrm1, x)
## Error in (function (string)  : unused argument ("L ")
## "Would love to understand why this error is happening"

spacrm2 <- function(...) {gsub('\\s+', '', paste(..., collapse = ''))}
gsubfn('(([A-Z]\\s+){2,}[A-Z])', spacrm2, x)
## [1] "Welcome to A I: the best WORLDL!"               
## [2] "Hi I R is the BOMBM for sure: we AGREEE indeed."
## "Would love to understand why the extra letter is happening"

Desired Output

[1] "Welcome to A I: the best WORLD!"                 
[2] "Hi I R is the BOMB for sure: we AGREE indeed."
like image 676
Tyler Rinker Avatar asked Nov 22 '17 13:11

Tyler Rinker


2 Answers

Overview

There is a way in R to do this using regex entirely, but it's not pretty (although I think it looks pretty sweet!) This answer is also customizable to whatever your needs are (two uppercase minimum, three minimum, etc.) - i.e. scalable - and can match more than one horizontal whitespace characters (doesn't use lookbehinds, which require a fixed width).


Code

See regex in use here

(?:(?=\b(?:\p{Lu}\h+){2}\p{Lu})|\G(?!\A))\p{Lu}\K\h+(?=\p{Lu})

Replacement: Empty string


Edit 1 (non-ASCII letters)

My original pattern used \b, which may not work with Unicode characters (such as É). The following alternative is likely a better approach. It checks to ensure what precedes the first uppercase character is not a letter (from any language/script). It also ensures that it doesn't match an uppercase character at the end of the uppercase series if it is followed by any other letter.

If you also need to ensure numbers don't precede uppercase letters, you can use [^\p{L}\p{N}] in the place of \P{L}.

See regex in use here

(?:(?<=\P{L})(?=(?:\p{Lu}\h+){2}\p{Lu})|\G(?!\A))\p{Lu}\K\h+(?=\p{Lu}(?!\p{L}))

Usage

See code in use here

x <- c(
    "Welcome to A I: the best W O R L D!",
    "Hi I R is the B O M B for sure: we A G R E E indeed."
)
gsub("(?:(?=\\b(?:\\p{Lu}\\h+){2}\\p{Lu})|\\G(?!\\A))\\p{Lu}\\K\\h+(?=\\p{Lu})", "", x, perl=TRUE)

Results

Input

Welcome to A I: the best W O R L D!
Hi I R is the B O M B for sure: we A G R E E indeed.

Output

Welcome to A I: the best WORLD!
Hi I R is the BOMB for sure: we AGREE indeed.

Explanation

  • (?:(?=(?:\b\p{Lu}\h+){2}\p{Lu})|\G(?!\A)) Match either of the following
    • (?=\b(?:\p{Lu}\h+){2}\p{Lu}) Positive lookahead ensuring what follows matches (used as an assertion in this case to find all locations in the string that are in the format A A A). You can also add \b at the end of this positive lookahead to ensure something like I A Name doesn't get matched
      • \b Assert position at a word boundary
      • (?:\p{Lu}\h+){2} Match the following exactly twice
        • \p{Lu} Match an uppercase character in any language (Unicode)
        • \h+ Match one or more horizontal whitespace characters
      • \p{Lu} Match an uppercase character in any language (Unicode)
    • \G(?!\A) Assert position at the end of the previous match
  • \p{Lu} Match an uppercase character in any language (Unicode)
  • \K Resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match
  • \h+ Match one or more horizontal whitespace characters
  • (?=\p{Lu}) Positive lookahead ensuring what follows is an uppercase character in any language (Unicode)

Edit 2 (python)

Below is the python equivalent of above (it requires PyPi regex to run). I replaced \h with [ \t] as PyPi regex doesn't currently support \h token.

See the working code here

import regex
a = [
    "Welcome to A I: the best W O R L D!",
    "Hi I R is the B O M B for sure: we A G R E E indeed."
]

r = regex.compile(r"(?:(?=\b(?:\p{Lu} +){2}\p{Lu})|\G(?!\A))\p{Lu}\K +(?=\p{Lu})")
for i in a:
    print(r.sub('',i))

Above regex based on first regex. If you're looking to use the second regex, use this:

(?:(?<=\P{L})(?=(?:\p{Lu}[ \t]+){2}\p{Lu})|\G(?!\A))\p{Lu}\K[ \t]+(?=\p{Lu}(?!\p{L}))

Using a callback

Please see Wiktor's original answer regarding callbacks, this is simply a ported version of his R program into python. This doesn't use the PyPi regex library and so it won't match. Also, this won't match Unicode.

import re
a = [
    "Welcome to A I: the best W O R L D!",
    "Hi I R is the B O M B for sure: we A G R E E indeed."
]

def repl(m):
    return re.sub(r"\s+",'',m.group(0))

for i in a:
    print(re.sub(r"(?:[A-Z]\s+){2,}[A-Z]", repl, i))
like image 78
ctwheels Avatar answered Oct 23 '22 05:10

ctwheels


As I pointed out in the comments the problem in the first gsubfn call in the question arises from there being two capture groups in the regex yet only one argument to the function. These need to match -- two capture groups implies a need for two arguments. We can see what gsubfn is passing by running this and viewing the print statement's output:

junk <- gsubfn('(([A-Z]\\s+){2,}[A-Z])', ~ print(list(...)), x)

We can address this in any of the following ways:

1) This uses the regex from the question but uses a function that accepts multiple arguments. Only the first argument is actually used in the function.

gsubfn('(([A-Z]\\s+){2,}[A-Z])', ~ gsub("\\s+", "", ..1), x)
## [1] "Welcome to A I: the best WORLD!"              
## [2] "Hi I R is the BOMB for sure: we AGREE indeed."

Note that it interprets the formula as the function:

function (...) gsub("\\s+", "", ..1)

We can view the function generated from the formula like this:

fn$identity( ~ gsub("\\s+", "", ..1) )
## function (...) 
## gsub("\\s+", "", ..1)

2) This uses the regex from the question and also the function from the question but adds the backref = -1 argument which tells it to pass only the first capture group to the function -- the minus means do not pass the entire match either.

gsubfn('(([A-Z]\\s+){2,}[A-Z])', spacrm1, x, backref = -1)

(As @Wiktor Stribiżew points out in his answer backref=0 would also work.)

3) Another way to express this using the regex from the question is:

gsubfn('(([A-Z]\\s+){2,}[A-Z])', x + y ~ gsub("\\s+", "", x), x)

Note that it interprets the formula as this function:

function(x, y) gsub("\\s+", "", x)
like image 8
G. Grothendieck Avatar answered Oct 23 '22 06:10

G. Grothendieck