Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whats the difference between \z and \Z in a regular expression and when and how do I use it?

Tags:

java

regex

From http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html:

\Z  The end of the input but for the final terminator, if any
\z  The end of the input

But what does it mean in practice? Can you give me an example when I use either the \Z or \z.

In my test I thought that "StackOverflow\n".matches("StackOverflow\\z") will return true and "StackOverflow\n".matches("StackOverflow\\Z") returns false. But actually both return false. Where is the mistake?

like image 589
Mister M. Bean Avatar asked Apr 25 '10 10:04

Mister M. Bean


People also ask

What is the difference between * and in regex?

represents any single character (usually excluding the newline character), while * is a quantifier meaning zero or more of the preceding regex atom (character or group). ? is a quantifier meaning zero or one instances of the preceding atom, or (in regex variants that support it) a modifier that sets the quantifier ...

What is the meaning of \\ Z in Java?

The subexpression/metacharacter “\Z” matches the end of the entire string except allowable final line terminator.

What is a regular expression and how it is used?

A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.


4 Answers

Even though \Z and $ only match at the end of the string (when the option for the caret and dollar to match at embedded line breaks is off), there is one exception. If the string ends with a line break, then \Z and $ will match at the position before that line break, rather than at the very end of the string.

This "enhancement" was introduced by Perl, and is copied by many regex flavors, including Java, .NET and PCRE. In Perl, when reading a line from a file, the resulting string will end with a line break. Reading a line from a file with the text "joe" results in the string joe\n. When applied to this string, both ^[a-z]+$ and \A[a-z]+\Z will match "joe".

If you only want a match at the absolute very end of the string, use \z (lower case z instead of upper case Z). \A[a-z]+\z does not match joe\n. \z matches after the line break, which is not matched by the character class.

http://www.regular-expressions.info/anchors.html

The way I read this "StackOverflow\n".matches("StackOverflow\\z") should return false because your pattern does not include the newline.

"StackOverflow\n".matches("StackOverflow\\z\\n") => false
"StackOverflow\n".matches("StackOverflow\\Z\\n") => true
like image 155
Jakob Kruse Avatar answered Oct 16 '22 14:10

Jakob Kruse


Just checked it. It looks like when Matcher.matches() is invoked(like in your code, behind the scenes), \Z behaves like \z. However, when Matcher.find() is invoked, they behave differently as expected. The following returns true:

Pattern p = Pattern.compile("StackOverflow\\Z");
Matcher m = p.matcher("StackOverflow\n");
System.out.println(m.find());

and if you replace \Z with \z it returns false.

I find this a little surprising...

like image 25
Eyal Schneider Avatar answered Oct 16 '22 13:10

Eyal Schneider


I think the main problem here is the unexpected behavior of matches(): any match must consume the whole input string. Both of your examples fail because the regexes don't consume the linefeed at the end of the string. The anchors have nothing to do with it.

In most languages, a regex match can occur anywhere, consuming all, some, or none of the input string. And Java has a method, Matcher#find(), that performs this traditional kind of match. However, the results are the opposite of what you said you expected:

Pattern.compile("StackOverflow\\z").matcher("StackOverflow\n").find()  //false
Pattern.compile("StackOverflow\\Z").matcher("StackOverflow\n").find()  //true

In the first example, the \z needs to match the end of the string, but the trailing linefeed is in the way. In the second, the \Z matches before the linefeed, which is at the end of the string.

like image 2
Alan Moore Avatar answered Oct 16 '22 13:10

Alan Moore


I think Alan Moore provided the best answer, especially the crucial point that matches silently inserts ^ and $ into its regex argument.

I'd also like to add a few examples. And a little more explanation.

\z matches only at the very end of the string.

\Z also matches at the very end of the string, but if there's a \n, it will match before it.

Consider this program:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        Pattern p = Pattern.compile(".+\\Z"); // some word before the end of the string
        String text = "one\ntwo\nthree\nfour\n";
        Matcher m = p.matcher(text);
        while (m.find()) {
            System.out.println(m.group());
        }
    }
}

It will find 1 match, and print "four".

Change \Z to \z, and it will not match anything, because it doesn't want to match before the \n.

However, this will also print four, because there's no \n at the end:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        Pattern p = Pattern.compile(".+\\z");
        String text = "one\ntwo\nthree\nfour";
        Matcher m = p.matcher(text);
        while (m.find()) {
            System.out.println(m.group());
        }
    }
}
like image 2
pavel_orekhov Avatar answered Oct 16 '22 14:10

pavel_orekhov