Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse CSV/TSV file in Haskell - Unicode Characters

Tags:

csv

haskell

I'm trying to parse a tab-delimited file using cassava/Data.Csv in Haskell. However, I get problems if there are "strange" (Unicode) characters in my CSV file. I'll get a parse error (endOfInput) then.

According to the command-line tool "file", my file has a "UTF-8 Unicode text" decoding. My Haskell code looks like this:

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString as C
import qualified System.IO.UTF8 as U
import qualified Data.ByteString.UTF8 as UB
import qualified Data.ByteString.Lazy.Char8 as DL
import qualified Codec.Binary.UTF8.String as US
import qualified Data.Text.Lazy.Encoding as EL
import qualified Data.ByteString.Lazy as L

import Data.Text.Encoding as E

-- Handle CSV / TSV files with ...
import Data.Csv
import qualified Data.Vector as V

import Data.Char -- ord

csvFile :: FilePath
csvFile = "myFile.txt"

-- Set delimiter to \t (tabulator)
myOptions = defaultDecodeOptions {
              decDelimiter = fromIntegral (ord '\t')
            }

main :: IO ()
main = do
  csvData <- L.readFile csvFile 
  case EL.decodeUtf8' csvData of 
   Left err -> print err
   Right dat ->
     case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of
       Left err -> putStrLn err
       Right v -> V.forM_ v $ \ (category :: String ,
                               user :: String ,
                               date :: String,
                               time :: String,
                               message :: String) -> do
         print message

I tried using decodingUtf8', preprocessing (filtering) the input with predicates from Data.Char, and much more. However the endOfFile error persists.

My CSV-file looks like this:

a   -   -   -   RT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a   -   -   -   Uhm .. wat dan ook ????!!!! 👋

Or more literally:

a\t-\t-\t-\tRT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a\t-\t-\t-\tUhm .. wat dan ook ????!!!! 👋

The problem chars are the 👋 and • (and in my complete file, there are many more of similar characters). What can I do, so that cassava / Data.Csv can read my file properly?

EDIT: I've created the following preprocessor for escaping my Text before decoding it with cassava (see tibbe's answer). There's probably a better possibility, but so far, that works fine!

import qualified Data.Text as T

preprocess :: T.Text -> T.Text
preprocess txt = cons '\"' $ T.snoc escaped '\"'
  where escaped = T.concatMap escaper txt

escaper :: Char -> T.Text
escaper c
  | c == '\t' = "\"\t\""
  | c == '\n' = "\"\n\""
  | c == '\"' = "\"\""
  | otherwise = T.singleton c
like image 931
Pold Avatar asked Sep 30 '22 13:09

Pold


1 Answers

Per the cassava documentation:

  • Non-escaped fields may contain any characters except double-quotes, commas, carriage returns, and newlines.

  • Escaped fields may contain any characters (but double-quotes need to be escaped).

Since the last field in your first record contains double quotes the field needs to be escaped with double quotes and any double quotes need to be escaped, like so:

a   -   -   -   "RT USE "" Kenny"" • Hahahahahahahahaha. #Emmen #Brandstapel"

This code works for me:

import Data.ByteString.Lazy
import Data.Char
import Data.Csv
import Data.Text.Encoding
import Data.Vector

test :: Either String (Vector (String, String, String, String, String))
test = decodeWith
    defaultDecodeOptions {decDelimiter = fromIntegral $ ord '\t' }
    NoHeader
    (fromStrict $ encodeUtf8 "a\t-\t-\t-\t\"RT USE \"\" Kenny\"\" • Hahahahahahahahaha. #Emmen #Brandstapel\"")

(Note that I had to make sure to use encodeUtf8 on a literal of type Text rather than just using a ByteString literal directly. The IsString instance for ByteStrings, which is what's used to convert the literal to a ByteString, truncates each Unicode code point.)

like image 184
tibbe Avatar answered Oct 03 '22 07:10

tibbe