Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching specific unicode char in haskell regexp

This is Mac/OSX related problem!

I have the following three character long haskell string:

"a\160b"

I want to match and replace the middle character

Several approaches like

ghci> :m +Text.Regex
ghci> subRegex (mkRegex "\160") "a\160b" "X"
  "*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))
ghci> subRegex (mkRegex "\\160") "a\160b" "X"
  "a\160b"

did not yield the desired result.

How do I have to modify the regexp or my environment to replace the '\160' with the 'X' ?

The problem seems to have it's root in the locale/encoding of the input.

bash> locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

I already modified my .bashrc to export the following env-vars:

bash> locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

But this did not change the behavior at all.

like image 281
Axel Tetzlaff Avatar asked Feb 18 '11 23:02

Axel Tetzlaff


2 Answers

I was able to reproduce your problem by setting my locale to 'en_US.UTF-8'. (I am also using MacOSX.)

bash> export LANG=en_US.UTF-8
bash> ghci                   
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X"
"*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))

Setting your locale to 'C' should fix the problem:

bash> export LANG=C
bash> ghci                   
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X"
"aXb"

Unfortunately, I don't have an explanation as to why the locale is causing this problem.

like image 56
David Powell Avatar answered Nov 03 '22 07:11

David Powell


Is there a specific reason you want to use regular expressions, and not simply map?

replace :: Char -> Char
replace '\160' = 'X'
replace c      = c

test = map replace "a\160b" == "aXb"

Note that if you want to work with Unicode strings, it's probably easier to use the text package which is designed to handle Unicode, and more efficient than String for larger strings.

like image 36
nominolo Avatar answered Nov 03 '22 07:11

nominolo