I have a unicode file containing a (Swedish) wikipedia article in MediaText markup. I want to clean it from all markup. In certain cases I want to extract text from the markup tags, such as the link titles from hyperlinks (like a simplified wikiextractor).
My approach is to run a set of regex'es over the file to remove markup. In the link-example, I need to replace [[link]]
with link
. I manage to fix this well with a regex as long as the text does not contain unicode chars such as ö
.
Example of what I've tried:
ghci> :m +Data.Text
ghci> subRegex (mkRegex "\\[\\[([() a-zA-Z]*)\\]\\]") "Se mer om [[Stockholm]]" "\\1"
"Se mer om Stockholm"
ghci> subRegex (mkRegex "\\[\\[([() a-zA-Z]*)\\]\\]") "Se mer om [[Göteborg]]" "\\1"
"Se mer om [[G\246teborg]]"
Why does this not work? How can I make the regex engine realize that ö
is indeed a normal letter (at least in Swedish)?
Edit:
The issue seems to not really sit in the pattern, but in the engine. If i allow all characters except q
in the link text, one could expect ö
to be allowed. But not so...
ghci> subRegex (mkRegex "\\[\\[([^q]*)\\]\\]") "[[Goteborg]]" "\\1"
"Goteborg"
ghci> subRegex (mkRegex "\\[\\[([^q]*)\\]\\]") "[[Göteborg]]" "\\1"
"[[G\246teborg]]"
ghci> subRegex (mkRegex "ö") "ö" "q"
"q"
ghci> subRegex (mkRegex "[ö]") "ö" "q"
"\246"
The problem seems to arise specifically when using character classes. It matches ö
fine on its own.
I've now decided to go with Text.Regex.PCRE.Heavy as suggested in this SO Answer written by the author. It solves my problem.
Thus, the solution becomes
GHCi, version 7.10.3: http://www.haskell.org/ghc/ :? for help
Prelude> :m Text.Regex.PCRE.Heavy
Prelude Text.Regex.PCRE.Heavy> :set -XFlexibleContexts
Prelude Text.Regex.PCRE.Heavy> :set -XQuasiQuotes
Prelude Text.Regex.PCRE.Heavy> gsub [re|\[\[([^\]]*)\]\]|] (\(firstMatch:_) -> firstMatch :: String) "[[Göteborg]]" :: String
"G\246teborg"
Unfortunately I still don't know why the POSIX backend cannot handle this, but the PCRE backend can.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With