Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex Game - Replace each word except specific ones by a variable number of characters

Tags:

regex

Hey there, you Regex Lovers !

I'm quite in Regex, these times and had a purely theorical problem. To put it simple, I will present it as a game.

The game :
Let's say you have a list of words separated by spaces.
What I call a word is as they are defined by regular expressions : [a-zA-Z_0-9]+ (There is no empty word here)
Example of list :
Horse Banana Joker RoXx0r A_Long_Word Joker 1337

What I want you to do is replace each word except Joker by a number of $ equal to the number of character of the matched word.
With our previous list we would obtain :
$$$$$ $$$$$$ Joker $$$$$$ $$$$$$$$$$$ Joker $$$$

In fewer words : I want a regex that matches each character that does not belong to the word "Joker" (In the string, I mean, not that compose the word Joker)

While it is not easy, it's not impossible (I have my own regex for that). That's why I will set some rules.

The rules :

  • It must be done with only 1 regex
  • I will not accept any regex that works only in specific languages
  • I will still accept most common features like Conditionals, Lookarounds, etc... even if some languages can't read them
  • No recursion allowed (but if you have a working recursive one, post it, just for the beauty of the regex ^^)
  • The regex must be optimized for performance
  • If your regex matches (get it ? ;) ) these rules but does not satisfy me, I will feel free to add some more rules

Added rules :

  • None



To help you out, here are some strings on which the regex must work :
Horse Banana Joker RoXx0r A_Long_Word Joker 1337 Joke Poker Joker Jokers
Must return after replacement :
$$$$$ $$$$$$ Joker $$$$$$ $$$$$$$$$$$ Joker $$$$ $$$$ $$$$$ Joker $$$$$$

Joker Joker Joker
Must return after replacement :
Joker Joker Joker

Again, solving the problem is not the goal here, I want to see different solutions, and more importantly I want to see the best ones !


Solutions :

A very elegant one by Casimir et Hippolyte :
(?:\G(?!^)|(?<!\S)(?!Joker(?:\s|$)))\S (replace : $)
See the post
However the \G take the fun out of the problem and does not work in every language, so I can't accept it unless is is possible to create a custom delimiter that is equivalent to \G

Almost accepted answer also by Casimir et Hippolyte :
((?:\s+|\bJoker\b)*)\S((?:\s+Joker)*\s*$)? (replace : $1$$2)
See the post
Does not work when there are only Joker words in the string

A similar solution by ClasG :
(\bJoker[^\w]+)\w|\w([^\w]+Joker\b)|\w (replace : $1$$2)
See the post
Does not work when there are only Joker words in the string

Another one by ClasG :
[^Joker\s]|(?<!\b)J|J(?!oker\b)|(?<!\bJ)o|o(?!ker\b)|(?<!\bJo)k|k(?!er\b)|(?<!\bJok)e|e(?!r\b)|(?<!\bJoke)r|r(?!\b) (replace : $)
See the post
Not very efficient, though, but it's another way of seeing things ;)

I came up with a similar regex after reading the comment of Rahul below :
(?(?<=\b|\bJ|\bJo|\bJok|\bJoke|\bJoker)(?!(?:Joke|oke|ke|e|)r\b)\w|\w) (replace $)
Regex101
It is also inefficient, but use the same lookaround list thing :)

Here is my first solution :
I use a trick that might be considered as cheating, but I don't because it would not alter the functions you use to replace characters. You just have to add a '$' at the end of the string before replacing charactes into it.
So instead of something like :
string = replace(string, regex, '$1$2')
We would have :
string = replace(string+'$', regex, '$1$2')

So here is the regex :
(\bJoker\b)|.$|\w(?=.*(\$)) (replace : $1$2)
Regex 101
This should work with all languages except those not supporting lookaheads (they are rather rare)


Keep posting new regex if you find ones, I want to see more ways to do it ! :)

like image 608
Gawil Avatar asked May 04 '17 19:05

Gawil


2 Answers

For PCRE/Perl/Ruby/Java/.net

find:

(?:\G(?!^)|(?<!\S)(?!Joker(?!\S)))\S

replace:

$

demo

pattern details:

(?:
    \G (?!^) # contigous to a previous match (but not at the start of the string)
  |        # OR
    (?<!\S)  # not preceded by a non white-space
    (?!Joker(?!\S)) # not followed by the forbidden word
)
\S   # a non-whitespace character

If your words are only composed of word characters, you can simplify the pattern playing with word and non-word boundaries: (?:\G\B|\b(?!Joker\b))\w


Other way (PCRE/Perl): without the \G feature and with the backtracking control verb (*SKIP) (need less steps):

\s*(?:Joker(?:\s+|$))*(*SKIP)\K.

To be clear (*SKIP) is only useful when the string ends with the forbidden word or a whitespace. You can also replace it with (*COMMIT).

demo

or:

\bJoker\b(*SKIP)(*F)|\S

and with pypi python regex module (that has a word boundary for the start and one for the end of a word):

\mJoker\M(*SKIP)(*F)|\S

A one that works with Javascript (if there's something to replace only):

find:

((?:\s+|\bJoker\b)*)\S((?:\s+Joker)*\s*$)?

replace: (backreference to group1, escaped $, backreference to group2)

$1$$$2 

demo


An other Javascript version that uses the y flag (that forces the matches to be contigous), but unfortunately this one isn't supported by Internet Explorer, Safari and mobile browsers except Firefox mobile:

var strs = ['Horse Banana Joker RoXx0r A_Long_Word Joker 1337 Joke Poker Joker', 'Joker Joker Joker'];

strs.forEach(function (s) {
    console.log(s.replace(/(?=((?:\s+|\bJoker\b)*))\1./gy, '$1$$'));
});

The (?=(...))\1 emulates an atomic group (that forbids backtracking).

like image 132
Casimir et Hippolyte Avatar answered Nov 15 '22 08:11

Casimir et Hippolyte


Can't really say why, but I wanted to see if I could make it without look-arounds. This is what I ended up with:

(\bJoker[^\w]+)\w|\w([^\w]+Joker\b)|\w

Substituting that with $1$$2 should do the trick.

It has one limitation though (that I thought of). It wont handle Joker as a single word on the line :(. That's because the logic behind it is...

It matches the word Joker in two alternations - either with a letter following it, or preceding it. In both cases separating the word from the letter by non letters (spaces). There is a third alternative as well - a single letter. If none of the two first matches, this will find non Joker-related letters. In the first two cases, the word plus adjacent spaces (non-letters) get captured into a group (Joker-space). Same goes for second alternative, but in reversed order (space-Joker). The third alternative doesn't capture anything . it just matches a letter.

Replacing the complete match with $1$$2 (note the literal $ in the middle) either inserts the word Joker plus spaces (if the first alternation matched) followed by a $. If the first didn't match, but the second did, the inserted replacement would be the $ plus captured spaces followed by Joker. If none of the two first matched, nothing is captured, and the only thing inserted will be the sole $, replacing whatever letter matched.

See it here at regex101.

Edit:

Just noticed that Casimir et Hippolyte has a version at the end that's similar to mine. They're not identical though, so I'll leave my answer here for now ;)

like image 28
SamWhan Avatar answered Nov 15 '22 10:11

SamWhan