Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsec fails without error if reading from file

I wrote a small parsec parser to read samples from a user supplied input string or an input file. It fails properly on wrong input with a useful error message if the input is provided as a semicolon separated string:

> readUncalC14String "test1,7444,37;6800,36;testA,testB,2000,222;test3,7750,40"
*** Exception: Error in parsing dates from string: (line 1, column 29):
unexpected "t"
expecting digit

But it fails silently for the input file inputFile.txt with identical entries:

test1,7444,37
6800,36
testA,testB,2000,222
test3,7750,40
> readUncalC14FromFile "inputFile.txt"
[UncalC14 "test1" 7444 37,UncalC14 "unknownSampleName" 6800 36]

Why is that and how can I make readUncalC14FromFile fail in a useful manner as well?

Here is a minimal subset of my code:

import qualified Text.Parsec                    as P
import qualified Text.Parsec.String             as P

data UncalC14 = UncalC14 String Int Int deriving Show

readUncalC14FromFile :: FilePath -> IO [UncalC14]
readUncalC14FromFile uncalFile = do
    s <- readFile uncalFile
    case P.runParser uncalC14SepByNewline () "" s of
        Left err -> error $ "Error in parsing dates from file: " ++ show err
        Right x -> return x
    where
        uncalC14SepByNewline :: P.Parser [UncalC14]
        uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces)

readUncalC14String :: String -> Either String [UncalC14]
readUncalC14String s = 
    case P.runParser uncalC14SepBySemicolon () "" s of
        Left err -> error $ "Error in parsing dates from string: " ++ show err
        Right x -> Right x
    where 
        uncalC14SepBySemicolon :: P.Parser [UncalC14]
        uncalC14SepBySemicolon = P.sepBy parseOneUncalC14 (P.char ';' <* P.spaces)

parseOneUncalC14 :: P.Parser UncalC14
parseOneUncalC14 = do
    P.try long P.<|> short
    where
        long = do
            name <- P.many (P.noneOf ",")
            _ <- P.oneOf ","
            mean <- read <$> P.many1 P.digit
            _ <- P.oneOf ","
            std <- read <$> P.many1 P.digit
            return (UncalC14 name mean std)
        short = do
            mean <- read <$> P.many1 P.digit
            _ <- P.oneOf ","
            std <- read <$> P.many1 P.digit
            return (UncalC14 "unknownSampleName" mean std)
like image 592
nevrome Avatar asked Jul 07 '21 08:07

nevrome


1 Answers

What is happening here is that a prefix of your input is a valid string. To force parsec to use the whole input you can use the eof parser:

uncalC14SepByNewline = P.endBy parseOneUncalC14 (P.newline <* P.spaces) <* P.eof

The reason that one works and the other doesn't is due to the difference between sepBy and endBy. Here is a simpler example:

sepTest, endTest :: String -> Either P.ParseError String
sepTest s = P.runParser (P.sepBy (P.char 'a') (P.char 'b')) () "" s
endTest s = P.runParser (P.endBy (P.char 'a') (P.char 'b')) () "" s

Here are some interesting examples:

ghci> sepTest "abababb"
Left (line 1, column 7):
unexpected "b"
expecting "a"

ghci> endTest "abababb"
Right "aaa"

ghci> sepTest "ababaa"
Right "aaa"

ghci> endTest "ababaa"
Left (line 1, column 6):
unexpected "a"
expecting "b"

As you can see both sepBy and endBy can fail silently, but sepBy fails silently if the prefix doesn't end in the separator b and endBy fails silently if the prefix doesn't end in the main parser a.

So you should use eof after both parsers if you want to make sure you read the whole file/string.

like image 182
Noughtmare Avatar answered Oct 20 '22 03:10

Noughtmare