Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex pattern discriminating between letters when it shouldn't?

Tags:

java

regex

I'm writing a regex for a simple username validation for practice. While I am sure there may be other issues with this pattern, I would like it if someone could explain this seemingly odd behavior I am getting.

import java.io.*;
import java.util.*;
import java.text.*;
import java.math.*;
import java.util.regex.*;

public class userRegex{
   public static void main(String[] args){
      Scanner in = new Scanner(System.in);
      int testCases = Integer.parseInt(in.nextLine());
      while(testCases>0){
         String username = in.nextLine();
         String pattern = "([[:alpha:]])[a-zA-Z_]{7,29}";
 Pattern r = Pattern.compile(pattern);
         Matcher m = r.matcher(username);

         if (m.find( )) {
            System.out.println("Valid");
         } else {
            System.out.println("Invalid");
         }
         testCases--;
      }
   }
}

When I input:

2
dfhidbuffon
dfdidbuffon

the compiler should return:

Valid
Valid

but instead, it returns

Valid
Invalid

Why does it discriminate between the difference of the 3rd letter being "h" or "d" in each of the usernames?

Edit: Added @Draco18s and @ruakh 's suggestions, however, I am still getting the same strange behaviour.

like image 926
user2533660 Avatar asked Nov 21 '16 01:11

user2533660


1 Answers

[:alpha:] doesn't have the special meaning that you intend; rather, it ends up just meaning "any of the characters :, a, h, l, p". So dfhidbuffon contains a match for your pattern (namely h plus idbuffon), whereas dfdidbuffon does not. (Note that matcher.find() looks for any match within the string; if you want to specifically match the entire string, you should use matcher.matches(), or you can modify your pattern to use anchors such as ^ and $.)

You may be thinking of the notation found in many regex implementations whereby [:alpha:] means "any alphabetic character"; but firstly, Java's Pattern class doesn't support that notation (hat-tip to ajb for pointing this out), and secondly, those languages would require [:alpha:] to appear inside a character class, e.g. as [[:alpha:]]. The Java equivalent would be \p{Alpha} or [A-Za-z] if you only want to match ASCII letters, and \p{IsAlphabetic} if you want to match any Unicode letter.

like image 157
ruakh Avatar answered Sep 28 '22 05:09

ruakh