Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Ruby /[[:punct:]]/ miss some punctuation characters?

Tags:

regex

posix

ruby

Ruby /[[:punct:]]/ is supposed to match all "punctuation characters". According to Wikipedia, this means /[\]\[!"#$%&'()*+,./:;<=>?@\^_`{|}~-]/ per POSIX standard.

It matches: -[]\;',./!@#%&*()_{}::"?.

However, it does not match: =`~$^+|<> (at least in ruby 1.9.3p194).

What gives?

like image 964
Sai Avatar asked Jun 21 '12 01:06

Sai


1 Answers

The greater than symbol is in the "Symbol, Math" category, not the punctuation category. You can see this if you force the regex's encoding to UTF-8 (it defaults to the source encoding, and presumably your source is UTF-8 encoded, while my default source is something else):

2.1.2 :004 > /[[:punct:]]/u =~ '<'
 => nil 
2.1.2 :005 > /[[:punct:]]/ =~ '<'
 => 0 

If you force the regex to ASCII encoding (/n - more options here) you'll see it categorize '<' in punct, which I think is what you want. However, this will probably cause problems if your source contains characters outside the ASCII subset of UTF-8.

2.1.2 :009 > /[[:punct:]]/n =~ '<'
 => 0 

A better solution would be to use the 'Symbol' category instead in your regex instead of the 'punct' one, which matches '<' in UTF-8 encoding:

2.1.2 :012 > /\p{S}/u =~ '<'
 => 0 

There's a longer list of categories here.

like image 64
Nicholas White Avatar answered Nov 07 '22 16:11

Nicholas White