Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does multibyte character interfere with end-line character within a regex?

With this regex:

regex1 = /\z/

the following strings match:

"hello" =~ regex1 # => 5
"こんにちは" =~ regex1 # => 5

but with these regexes:

regex2 = /#$/?\z/
regex3 = /\n?\z/

they show difference:

"hello" =~ regex2 # => 5
"hello" =~ regex3 # => 5
"こんにちは" =~ regex2 # => nil
"こんにちは" =~ regex3 # => nil

What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/ is "\n"). Are the multibyte characters interfering with $/? How?

like image 284
sawa Avatar asked Apr 03 '13 05:04

sawa


2 Answers

The problem you reported is definitely a bug of the Regexp of RUBY_VERSION #=> "2.0.0" but already existing in previous 1.9 when the encoding allow multi-byte chars such as __ENCODING__ #=> #<Encoding:UTF-8>

Does not depend on Linux , it's possibile to reproduce the same behavoir in OSX and Windows too.

In the while bug 8210 will be fixed, we can help by isolating and understanding the cases in which the problem occurs. This can also be useful for any workaround when applicable to specific cases.

I understand that the problem occurs when:

  • searching something before end of string \z.
  • and the last character of the string is multi-byte.
  • and the the before search uses zero or one pattern ?
  • but the number of zero or one char searched in less than the number of bytes of the last character.

The bug may be caused by misunderstandings between the number of bytes and the number of chars that is actually checked by the regular expression engine.

A few examples may help:

TEST 1: where last character:"は" is 3 bytes:

s = "んにちは"

testing for zero or one of ん [3 bytes] before end of string:

s =~ /ん?\z/u   #=> 4"       # OK it works 3 == 3

when we try with ç [2 bytes]

s =~ /ç?\z/u   #=> nil       # KO: BUG when 3 > 2
s =~ /x?ç?\z/u #=> 4         # OK it works 3 == ( 1+2 )

when test for zero or one of \n [1 bytes]

s =~ /\n?\z/u #=> nil"      # KO: BUG when 3 > 1
s =~ /\n?\n?\z/u #=> nil"   # KO: BUG when 3 > 2
s =~ /\n?\n?\n?\z/u #=> 4"  # OK it works 3 == ( 1+1+1)

By results of TEST1 we can assert: if the last multi-byte character of the string is 3 bytes , then the 'zero or one before' test only works when we test for at least 3 bytes (not 3 character) before.

TEST 2: Where last character "ç" is 2 bytes

s = "in French there is the ç" 

check for zero or one of ん [3 bytes]"

s =~ /ん?\z/u #=> 24        # OK 2 <= 3

check for zero or one of é [2 bytes]

s =~ /é?\z/u #=> 24         # OK 2 == 2
s =~ /x?é?\z/u #=> 24       # OK 2 < (2+1)

test for zero or one of \n [1 bytes]

s =~ /\n?\z/u    #=> nil    # KO 2 > 1  ( the BUG occurs )
s =~ /\n?\n?\z/u #=> 24     # OK 2 == (1+1)
s =~ /\n?\n?\n?\z/u #=> 24  # OK 2 < (1+1+1)

By results of TEST2 we can assert: if the last multi-byte character of the string is 2 bytes , then the 'zero or one before' test only works when we check for at least 2 bytes (not 2 character) before.

When the multi-byte character is not at the end of the string I found it works correctly.

public gist with my test code available here

like image 155
Franco Rondini Avatar answered Nov 08 '22 04:11

Franco Rondini


In Ruby trunk, the issue has now been accepted as a bug. Hopefully, it will be fixed.

Update: Two patches have been posted in Ruby trunk.

like image 1
sawa Avatar answered Nov 08 '22 04:11

sawa