Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting and comparing strings by locales in Haskell?

is it possible to properly sort strings with national characters in Haskell (GHC) ? In other words, correct collation of Chars by current locale settings ?

I did found ICU module only, but it requires extra library to be installed because it isn't a standard part of linux distributions. I would like solution based on POSIX's C (glibc like) library, so there won't be any hassle with handling of additional dependency.

like image 823
David Unric Avatar asked May 15 '11 21:05

David Unric


1 Answers

Recommended way: text-icu

The recommended way for robustly processing strings in a locale-sensitive manner is via text and text-icu, as you have seen. The text library is provided in the standard library set, the Haskell Platform.

An example, sorting Turkish strings:

{-# LANGUAGE OverloadedStrings #-}

import Data.Text.IO  as T 
import Data.Text.ICU as T 
import Data.List     (sortBy)

main = do
  let trLocale = T.Locale "tr-TR"
      str      = "ÇIİĞÖŞÜ"
      strs     = take 10 (cycle $ T.toLower trLocale str : str : [])

  mapM_ T.putStrLn (sortBy (T.compare [T.FoldCaseExcludeSpecialI]) strs)

appears to correctly sort by lexicographic ordering based on locale, after correctly lower-casing the Turkish string:

*Main> main
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
çıiğöşü
çıiğöşü
çıiğöşü
çıiğöşü
çıiğöşü

Not using the text-icu package

You've asked in your question to avoid solutions that use additional libraries, other than what Posix provides. While text-icu is easily installable from Hackage (cabal install text-icu), it does depend on the ICU C library, which isn't available everywhere. Additionally, there is no Posix alternative that is as robust or comprehensive. Finally, text-icu is the only package that correctly does conversions on multi-char characters.

Given this, though, the built in Char and String types in Haskell provide Data.Char, whose values represent Unicode, and with functions that will do Unicode case conversion, in a locale-insensitive way, using the wchar_t functions defined by the Open Group. Additionally, we can do IO on Handles in a (text) locale-sensitive way.

import System.IO  
import Data.Char
import Data.List  (sort)

main = do
    t <- mkTextEncoding "UTF-8"
    hSetEncoding stdout t

    let str      = "ÇIİĞÖŞÜ"
        strs     = take 10 (cycle $ map toLower str : str : [])

    mapM_ putStrLn (sort strs)

In fact, GHC will use your text locale by default for IO (e.g. UTF8). For many problems, this will probably give the right answer. You just have to be aware it will also be wrong in many cases, since it isn't possible to be correct without bulk processing of text, and rich conversion and comparison support.

*Main> main
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
çiiğöşü
çiiğöşü
çiiğöşü
çiiğöşü
çiiğöşü

like image 144
Don Stewart Avatar answered Sep 27 '22 20:09

Don Stewart