I'm trying to parse a tab-delimited file using cassava/Data.Csv in Haskell. However, I get problems if there are "strange" (Unicode) characters in my CSV file. I'll get a parse error (endOfInput)
then.
According to the command-line tool "file", my file has a "UTF-8 Unicode text" decoding. My Haskell code looks like this:
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString as C
import qualified System.IO.UTF8 as U
import qualified Data.ByteString.UTF8 as UB
import qualified Data.ByteString.Lazy.Char8 as DL
import qualified Codec.Binary.UTF8.String as US
import qualified Data.Text.Lazy.Encoding as EL
import qualified Data.ByteString.Lazy as L
import Data.Text.Encoding as E
-- Handle CSV / TSV files with ...
import Data.Csv
import qualified Data.Vector as V
import Data.Char -- ord
csvFile :: FilePath
csvFile = "myFile.txt"
-- Set delimiter to \t (tabulator)
myOptions = defaultDecodeOptions {
decDelimiter = fromIntegral (ord '\t')
}
main :: IO ()
main = do
csvData <- L.readFile csvFile
case EL.decodeUtf8' csvData of
Left err -> print err
Right dat ->
case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of
Left err -> putStrLn err
Right v -> V.forM_ v $ \ (category :: String ,
user :: String ,
date :: String,
time :: String,
message :: String) -> do
print message
I tried using decodingUtf8', preprocessing (filtering) the input with predicates from Data.Char, and much more. However the endOfFile error persists.
My CSV-file looks like this:
a - - - RT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a - - - Uhm .. wat dan ook ????!!!! 👋
Or more literally:
a\t-\t-\t-\tRT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a\t-\t-\t-\tUhm .. wat dan ook ????!!!! 👋
The problem chars are the 👋 and • (and in my complete file, there are many more of similar characters). What can I do, so that cassava / Data.Csv can read my file properly?
EDIT: I've created the following preprocessor for escaping my Text before decoding it with cassava (see tibbe's answer). There's probably a better possibility, but so far, that works fine!
import qualified Data.Text as T
preprocess :: T.Text -> T.Text
preprocess txt = cons '\"' $ T.snoc escaped '\"'
where escaped = T.concatMap escaper txt
escaper :: Char -> T.Text
escaper c
| c == '\t' = "\"\t\""
| c == '\n' = "\"\n\""
| c == '\"' = "\"\""
| otherwise = T.singleton c
Per the cassava documentation:
Non-escaped fields may contain any characters except double-quotes, commas, carriage returns, and newlines.
Escaped fields may contain any characters (but double-quotes need to be escaped).
Since the last field in your first record contains double quotes the field needs to be escaped with double quotes and any double quotes need to be escaped, like so:
a - - - "RT USE "" Kenny"" • Hahahahahahahahaha. #Emmen #Brandstapel"
This code works for me:
import Data.ByteString.Lazy
import Data.Char
import Data.Csv
import Data.Text.Encoding
import Data.Vector
test :: Either String (Vector (String, String, String, String, String))
test = decodeWith
defaultDecodeOptions {decDelimiter = fromIntegral $ ord '\t' }
NoHeader
(fromStrict $ encodeUtf8 "a\t-\t-\t-\t\"RT USE \"\" Kenny\"\" • Hahahahahahahahaha. #Emmen #Brandstapel\"")
(Note that I had to make sure to use encodeUtf8
on a literal of type Text
rather than just using a ByteString
literal directly. The IsString
instance for ByteString
s, which is what's used to convert the literal to a ByteString
, truncates each Unicode code point.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With