Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extra characters on the end of replaced text

Tags:

java

regex

php

pcre

In PHP and Java, I applied /^[^\pL]*|[^\pL]*$/ to ‍‍‍-A- and I got *A**. I applied a symmetric pattern and got an asymmetric result! Why? I wonder why its output is not *A*?

Pattern says that every thing except letter in the end of string should be replaced by *, it's also greedy and should replace all non-letter stuff together.

Alos note in RegexBuddy I get *A* that is what I expect.

Update: I simplified the question to focus my main concern.

like image 463
Handsome Nerd Avatar asked Mar 29 '13 15:03

Handsome Nerd


People also ask

How do you remove special characters from the end of a string?

The easiest way is to use the built-in substring() method of the String class. In order to remove the last character of a given String, we have to use two parameters: 0 as the starting index, and the index of the penultimate character.

How do you delete hidden characters in Word?

Remove using Find/Replace Then click the More button. Then click the Format dropdown and select Font. Then, make sure that the Hidden box under Effects is ticked and click OK.


2 Answers

#^[^\pL]+|[^\pL]+$#u

Replace * with +. Using * in combination with $ doesn't quite work as one would expect. In a bizarre consequence of how the regex engine works, X*$ will find two matches for X*. Using + fixes it.

Explanation

[^\pL]*$

Let's look at this part of the regex, the part that's not working as expected. Why does it put two *'s at the end of some of the strings?

  1. Consider the third example string ---A--- after the first set of dashes have been replaced:

    *A---$
    
  2. The regex engine finds a match for the regex here:

    *A---$
      ^
    
  3. And replaces "---" with an asterisk:

    *A*$
      ^
    
  4. It then moves its internal cursor to the right of the replacement string.

    *A*$
       ^
    
  5. It starts at this cursor position and looks for another match. And it finds one! It finds ""—the empty string! "" consists of 0-or-more non-letters ([^\pL]*), and it's anchored at the end of the string ($), so it's a valid match. It's found the empty string, sure, but that's allowed.

    It's unexpected because it's matched the $ anchor yet again. Isn't that wrong? It shouldn't match $ again, should it? Well, actually, it should, and does. It can match $ again because $ isn't an actual character in the input string—it's a zero-width assertion. It doesn't get "used up" by the first replacement. $ is allowed to match twice.

  6. And hence, it "replaces" the empty string "" with an asterisk. This is why you end up with two asterisks.

    *A**$
       ^
    
  7. If the regex engine returned to step 4, it would find yet another empty string and add yet another asterisk. Conceptually speaking, there are an infinite number of empty strings there. To avoid this the engine doesn't allow the next match to start at the same position as the previous one. This rule keeps it from entering an infinite loop.

like image 112
John Kugelman Avatar answered Oct 10 '22 03:10

John Kugelman


Correct regex would be like this:

$arr = preg_replace('#^[^\pL]+|[^\pL]+$#','*', 
           array('A','-A-','---A---','-+*A*+-','------------A------------'));

Note + instead of *. This will give output:

Array
(
    [0] => A
    [1] => *A*
    [2] => *A*
    [3] => *A*
    [4] => *A*
)

PS: Note that first element will remain unchanged due to the fact that there is no non-alpha character before and after A.

like image 21
anubhava Avatar answered Oct 10 '22 04:10

anubhava