Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find and replace unicode chars in Haskell?

I have a unicode file containing a (Swedish) wikipedia article in MediaText markup. I want to clean it from all markup. In certain cases I want to extract text from the markup tags, such as the link titles from hyperlinks (like a simplified wikiextractor).

My approach is to run a set of regex'es over the file to remove markup. In the link-example, I need to replace [[link]] with link. I manage to fix this well with a regex as long as the text does not contain unicode chars such as ö.

Example of what I've tried:

ghci> :m +Data.Text
ghci> subRegex (mkRegex "\\[\\[([() a-zA-Z]*)\\]\\]") "Se mer om [[Stockholm]]" "\\1"
"Se mer om Stockholm"
ghci> subRegex (mkRegex "\\[\\[([() a-zA-Z]*)\\]\\]") "Se mer om [[Göteborg]]" "\\1"
"Se mer om [[G\246teborg]]"

Why does this not work? How can I make the regex engine realize that ö is indeed a normal letter (at least in Swedish)?

Edit: The issue seems to not really sit in the pattern, but in the engine. If i allow all characters except q in the link text, one could expect ö to be allowed. But not so...

ghci> subRegex (mkRegex "\\[\\[([^q]*)\\]\\]") "[[Goteborg]]" "\\1"
"Goteborg"
ghci> subRegex (mkRegex "\\[\\[([^q]*)\\]\\]") "[[Göteborg]]" "\\1"
"[[G\246teborg]]"
ghci> subRegex (mkRegex "ö") "ö" "q"
"q"
ghci> subRegex (mkRegex "[ö]") "ö" "q"
"\246"

The problem seems to arise specifically when using character classes. It matches öfine on its own.

like image 597
LudvigH Avatar asked Oct 30 '22 05:10

LudvigH


1 Answers

I've now decided to go with Text.Regex.PCRE.Heavy as suggested in this SO Answer written by the author. It solves my problem.

Thus, the solution becomes

GHCi, version 7.10.3: http://www.haskell.org/ghc/  :? for help
Prelude> :m Text.Regex.PCRE.Heavy
Prelude Text.Regex.PCRE.Heavy> :set -XFlexibleContexts
Prelude Text.Regex.PCRE.Heavy> :set -XQuasiQuotes
Prelude Text.Regex.PCRE.Heavy> gsub [re|\[\[([^\]]*)\]\]|] (\(firstMatch:_) -> firstMatch :: String) "[[Göteborg]]" :: String
"G\246teborg"

Unfortunately I still don't know why the POSIX backend cannot handle this, but the PCRE backend can.

like image 160
LudvigH Avatar answered Nov 15 '22 07:11

LudvigH