In PHP and Java, I applied <code>/^[^\pL]*|[^\pL]*$/</code> to &zwj;&zwj;&zwj;<code>-A-</code> and I got <code>*A**</code>. I applied a symmetric pattern and got an asymmetric result! Why? I wonder why its output is not <code>*A*</code>? Pattern says that every thing except letter in the end of string should be replaced by <code>*</code>, it's also greedy and should replace all non-letter stuff together. Alos note in RegexBuddy I get <code>*A*</code> that is what I expect. Update: I simplified the question to focus my main concern.

<pre class="prettyprint"><code>#^[^\pL]+|[^\pL]+$#u </code></pre> Replace <code>*</code> with <code>+</code>. Using <code>*</code> in combination with <code>$</code> doesn't quite work as one would expect. In a bizarre consequence of how the regex engine works, <code>X*$</code> will find two matches for <code>X*</code>. Using <code>+</code> fixes it. <h3>Explanation</h3> <pre class="prettyprint"><code>[^\pL]*$ </code></pre> Let's look at this part of the regex, the part that's not working as expected. Why does it put two <code>*</code>'s at the end of some of the strings? <ol> <li> Consider the third example string <code>---A---</code> after the first set of dashes have been replaced: <pre class="prettyprint"><code>*A---$ </code></pre> </li> <li> The regex engine finds a match for the regex here: <pre class="prettyprint"><code>*A---$ ^ </code></pre> </li> <li> And replaces <code>"---"</code> with an asterisk: <pre class="prettyprint"><code>*A*$ ^ </code></pre> </li> <li> It then moves its internal cursor to the right of the replacement string. <pre class="prettyprint"><code>*A*$ ^ </code></pre> </li> <li> It starts at this cursor position and looks for another match. And it finds one! It finds <code>""</code>—the empty string! <code>""</code> consists of 0-or-more non-letters (<code>[^\pL]*</code>), and it's anchored at the end of the string (<code>$</code>), so it's a valid match. It's found the empty string, sure, but that's allowed. It's unexpected because it's matched the <code>$</code> anchor yet again. Isn't that wrong? It shouldn't match <code>$</code> again, should it? Well, actually, it should, and does. It can match <code>$</code> again because <code>$</code> isn't an actual character in the input string—it's a zero-width assertion. It doesn't get "used up" by the first replacement. <code>$</code> is allowed to match twice. </li> <li> And hence, it "replaces" the empty string <code>""</code> with an asterisk. This is why you end up with two asterisks. <pre class="prettyprint"><code>*A**$ ^ </code></pre> </li> <li>If the regex engine returned to step 4, it would find yet another empty string and add yet another asterisk. Conceptually speaking, there are an infinite number of empty strings there. To avoid this the engine doesn't allow the next match to start at the same position as the previous one. This rule keeps it from entering an infinite loop.</li> </ol>

Correct regex would be like this: <pre class="prettyprint"><code>$arr = preg_replace('#^[^\pL]+|[^\pL]+$#','*', array('A','-A-','---A---','-+*A*+-','------------A------------')); </code></pre> Note <code>+</code> instead of <code>*</code>. This will give output: <pre class="prettyprint"><code>Array ( [0] => A [1] => *A* [2] => *A* [3] => *A* [4] => *A* ) </code></pre> PS: Note that first element will remain unchanged due to the fact that there is no non-alpha character before and after A.

Extra characters on the end of replaced text

2 Answers

#^[^\pL]+|[^\pL]+$#u

Replace * with +. Using * in combination with $ doesn't quite work as one would expect. In a bizarre consequence of how the regex engine works, X*$ will find two matches for X*. Using + fixes it.

Explanation

[^\pL]*$

Let's look at this part of the regex, the part that's not working as expected. Why does it put two *'s at the end of some of the strings?

Consider the third example string ---A--- after the first set of dashes have been replaced:
```
*A---$
```
The regex engine finds a match for the regex here:
```
*A---$
  ^
```
And replaces "---" with an asterisk:
```
*A*$
  ^
```
It then moves its internal cursor to the right of the replacement string.
```
*A*$
   ^
```
It starts at this cursor position and looks for another match. And it finds one! It finds ""—the empty string! "" consists of 0-or-more non-letters ([^\pL]*), and it's anchored at the end of the string ($), so it's a valid match. It's found the empty string, sure, but that's allowed.

It's unexpected because it's matched the $ anchor yet again. Isn't that wrong? It shouldn't match $ again, should it? Well, actually, it should, and does. It can match $ again because $ isn't an actual character in the input string—it's a zero-width assertion. It doesn't get "used up" by the first replacement. $ is allowed to match twice.
And hence, it "replaces" the empty string "" with an asterisk. This is why you end up with two asterisks.
```
*A**$
   ^
```
If the regex engine returned to step 4, it would find yet another empty string and add yet another asterisk. Conceptually speaking, there are an infinite number of empty strings there. To avoid this the engine doesn't allow the next match to start at the same position as the previous one. This rule keeps it from entering an infinite loop.

112

answered Oct 10 '22 03:10

John Kugelman

Correct regex would be like this:

$arr = preg_replace('#^[^\pL]+|[^\pL]+$#','*', 
           array('A','-A-','---A---','-+*A*+-','------------A------------'));

Note + instead of *. This will give output:

Array
(
    [0] => A
    [1] => *A*
    [2] => *A*
    [3] => *A*
    [4] => *A*
)

PS: Note that first element will remain unchanged due to the fact that there is no non-alpha character before and after A.

answered Oct 10 '22 04:10

anubhava

Related questions
                            
                                Remove "empty" character from String
                            
                                Howto get rid of <mvc:annotation-driven />?
                            
                                difference between servlet lifecycle and filter lifecycle
                            
                                Basics - Troubleshooting Hibernate / JDBC Connection Pool Issue
                            
                                Why is subclassing not allowed for many of the SWT Controls?
                            
                                Why should a Java programmer care about year 2038 bug?
                            
                                Remove specific characters from string in Java
                            
                                java.lang.IllegalArgumentException: Invalid or unreadable WAR file : error in opening zip file
                            
                                How to turn off log4j warnings?
                            
                                SimpleDateFormat String
                            
                                HTML-Entity escaping to prevent XSS
                            
                                Setting Excel cell value in Numeric format using POI
                            
                                Modal ConfirmDialog over modal Dialog -> everything is blocked
                            
                                How do I get xjc?
                            
                                How can I get the current user in Liferay?
                            
                                Go channel vs Java BlockingQueue
                            
                                How to check if a table or a column exists in a database?
                            
                                Tab key navigation in JavaFX TextArea
                            
                                Eclipse not giving me variable name suggestions
                            
                                Multiline JAVA_OPTS in setenv.sh

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extra characters on the end of replaced text

Tags:

java

regex

php

pcre

Handsome Nerd

People also ask

2 Answers

Explanation

John Kugelman

anubhava

Recent Activity

Donate For Us