With this regex:
regex1 = /\z/
the following strings match:
"hello" =~ regex1 # => 5
"こんにちは" =~ regex1 # => 5
but with these regexes:
regex2 = /#$/?\z/
regex3 = /\n?\z/
they show difference:
"hello" =~ regex2 # => 5
"hello" =~ regex3 # => 5
"こんにちは" =~ regex2 # => nil
"こんにちは" =~ regex3 # => nil
What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/
is "\n"
). Are the multibyte characters interfering with $/
? How?
The problem you reported is definitely a bug of the Regexp
of RUBY_VERSION #=> "2.0.0"
but already existing in previous 1.9 when the encoding allow multi-byte chars such as __ENCODING__ #=> #<Encoding:UTF-8>
Does not depend on Linux , it's possibile to reproduce the same behavoir in OSX and Windows too.
In the while bug 8210 will be fixed, we can help by isolating and understanding the cases in which the problem occurs. This can also be useful for any workaround when applicable to specific cases.
I understand that the problem occurs when:
\z
.?
The bug may be caused by misunderstandings between the number of bytes and the number of chars that is actually checked by the regular expression engine.
A few examples may help:
s = "んにちは"
s =~ /ん?\z/u #=> 4" # OK it works 3 == 3
s =~ /ç?\z/u #=> nil # KO: BUG when 3 > 2
s =~ /x?ç?\z/u #=> 4 # OK it works 3 == ( 1+2 )
s =~ /\n?\z/u #=> nil" # KO: BUG when 3 > 1
s =~ /\n?\n?\z/u #=> nil" # KO: BUG when 3 > 2
s =~ /\n?\n?\n?\z/u #=> 4" # OK it works 3 == ( 1+1+1)
By results of TEST1 we can assert: if the last multi-byte character of the string is 3 bytes , then the 'zero or one before' test only works when we test for at least 3 bytes (not 3 character) before.
s = "in French there is the ç"
s =~ /ん?\z/u #=> 24 # OK 2 <= 3
s =~ /é?\z/u #=> 24 # OK 2 == 2
s =~ /x?é?\z/u #=> 24 # OK 2 < (2+1)
s =~ /\n?\z/u #=> nil # KO 2 > 1 ( the BUG occurs )
s =~ /\n?\n?\z/u #=> 24 # OK 2 == (1+1)
s =~ /\n?\n?\n?\z/u #=> 24 # OK 2 < (1+1+1)
By results of TEST2 we can assert: if the last multi-byte character of the string is 2 bytes , then the 'zero or one before' test only works when we check for at least 2 bytes (not 2 character) before.
When the multi-byte character is not at the end of the string I found it works correctly.
public gist with my test code available here
In Ruby trunk, the issue has now been accepted as a bug. Hopefully, it will be fixed.
Update: Two patches have been posted in Ruby trunk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With