Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

linux regex matching chracter ß

Tags:

regex

linux

I am running into something I could not see in Linux. Can any one tell me why the first regex is not picking up the "ß-carotene"?

$ cat cmpg
ß-Cyclopentyl-4-(7H-pyrrolo[2,3-d]pyrimidin-4-yl)-((3R)-1H-pyrazole-1-propanenitrile
ß-Cyclopentyl-4-(7H-pyrrolo[2,3-d]pyrimidin-4-yl)-((R)-1H-pyrazole-1-propanenitrile
ß-carotene  

$ cat cmpg|awk  '/[^\w\s({)}\r\n\[\]],/'
ß-Cyclopentyl-4-(7H-pyrrolo[2,3-d]pyrimidin-4-yl)-((3R)-1H-pyrazole-1-propanenitrile
ß-Cyclopentyl-4-(7H-pyrrolo[2,3-d]pyrimidin-4-yl)-((R)-1H-pyrazole-1-propanenitrile

cat cmpg|awk  '/ß/'
ß-Cyclopentyl-4-(7H-pyrrolo[2,3-d]pyrimidin-4-yl)-((3R)-1H-pyrazole-1-propanenitrile
ß-Cyclopentyl-4-(7H-pyrrolo[2,3-d]pyrimidin-4-yl)-((R)-1H-pyrazole-1-propanenitrile
ß-carotene

Thanks for the help!

like image 680
ygu Avatar asked Dec 21 '22 04:12

ygu


2 Answers

$ cat cmpg|awk  '/[^\w\s({)}\r\n\[\]],/'

only matches lines that contain at least one comma.

As for why the negated character class matches the 2 (which puzzled me because \w contains all ASCII digits, thus [^\w...] should fail to match 2): awk uses POSIX basic regular expressions that don't know the \w (or \s) shorthands. You would need to use [:alnum:] or [:space:] instead.

All in all, that regex is strange in any regex flavor. What are you trying to achieve with it?

like image 132
Tim Pietzcker Avatar answered Jan 07 '23 14:01

Tim Pietzcker


$ cat cmpg|awk  '/[^\w\s({)}\r\n\[\]],/'

looks for any string which have 2 characters:

  • the first character shoud NOT ([^) be :

    • \w : a "word" character (digits, alphanumerical, and underscore)
      • OR a literral w if that awk version doesn't know about \w special meaning
    • \s : a whitespace (could be a Lot of things if using unicode, not just space and tab)
      • OR a literral s if that awk version doesn't know about \s special meaning
    • ( : a (
    • { : a {
    • ) : a )
    • } : a }
    • \r : a linefeed
    • \n : a newline
    • \[ : a [
    • \] : a ]
  • the 2nd character HAVE to be :

    • , : a , (comma).

The last line does NOT contain a comma. (the Beta would match, otherwise, as it's not part of the above list)

like image 33
Olivier Dulac Avatar answered Jan 07 '23 13:01

Olivier Dulac