Using Ruby 1.9.2, I have the following Ruby code in IRB:
> r1 = /^(?=.*[\d])(?=.*[\W]).{8,20}$/i
> r2 = /^(?=.*\d)(?=.*\W).{8,20}$/i
> a = ["password", "1password", "password1", "pass1word", "password 1"]
> a.each {|p| puts "r1: #{r1.match(p) ? "+" : "-"} \"#{p}\"".ljust(25) + "r2: #{r2.match(p) ? "+" : "-"} \"#{p}\""}
This results in the following output:
r1: - "password" r2: - "password"
r1: + "1password" r2: - "1password"
r1: + "password1" r2: - "password1"
r1: + "pass1word" r2: - "pass1word"
r1: + "password 1" r2: + "password 1"
1.) Why do the results differ?
2.) Why would r1
match on strings 2, 3 and 4? Wouldn't the (?=.*[\W])
lookahead cause it to fail since there aren't any non-word characters in those examples?
This results from the interaction between a couple of regex features and Unicode. \W
is all non-word characters, which includes 212A - "KELVIN SIGN" K
(PDF link) and 017F - "LATIN SMALL LETTER LONG S" ſ
(PDF link). The /i
adds lower case versions of both of these, which are the “normal” k
and s
characters (006B - "LATIN SMALL LETTER K" and 0073 "LATIN SMALL LETTER S" (PDF link)).
So it’s the s
in password
that’s being interpreted as a non-word character in certain cases.
Note that this only seems to occur when the \W
is in a character class (i.e. [\W]
). Also I can only reproduce this in irb
, inside a standalone script it seems to work as expected.
See the Ruby bug about this for more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With