So, I'm writing a small parser that will extract all <td>
tag content with specific class, like this one <td class="liste">some content</td> --> Right "some content"
I will be parsing large html
file but I don't really care about all the noise, so idea was to consume all characters until I reach <td class="liste">
than I'd consume all characters (content) until </td>
and return content string.
This works fine if last element in a file is my td.liste
tag, but if I have some text after it or eof
than my parser consumes it and throws unexpected end of input
if you execute parseMyTest test3
.
See end of test3
to understand what is the edge case.
Here is my code so far :
import Text.Parsec
import Text.Parsec.String
import Data.ByteString.Lazy (ByteString)
import Data.ByteString.Char8 (pack)
colOP :: Parser String
colOP = string "<td class=\"liste\">"
colCL :: Parser String
colCL = string "</td>"
col :: Parser String
col = do
manyTill anyChar (try colOP)
content <- manyTill anyChar $ try colCL
return content
cols :: Parser [String]
cols = many col
test1 :: String
test1 = "<td class=\"liste\">Hello world!</td>"
test2 :: String
test2 = read $ show $ pack test1
test3 :: String
test3 = "\n\r<html>asdfasd\n\r<td class=\"liste\">Hello world 1!</td>\n<td class=\"liste\">Hello world 2!</td>\n\rasldjfasldjf<td class=\"liste\">Hello world 3!</td><td class=\"liste\">Hello world 4!</td>adsafasd"
parseMyTest :: String -> Either ParseError [String]
parseMyTest test = parse cols "test" test
btos :: ByteString -> String
btos = read . show
I created a combinator skipTill p end
which applies p
until end
matches and then returns what end
returns.
By contrast, manyTill p end
applies p
until end
matches and then
returns what the p
parsers matched.
import Text.Parsec
import Text.Parsec.String
skipTill :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m end
skipTill p end = scan
where
scan = end <|> do { p; scan }
td :: Parser String
td = do
string "("
manyTill anyChar (try (string ")"))
tds = do r <- many (try (skipTill anyChar (try td)))
many anyChar -- discard stuff at end
return r
test1 = parse tds "" "111(abc)222(def)333" -- Right ["abc", "def"]
test2 = parse tds "" "111" -- Right []
test3 = parse tds "" "111(abc" -- Right []
test4 = parse tds "" "111(abc)222(de" -- Right ["abc"]
Update
This also appears to work:
tds' = scan
where scan = (eof >> return [])
<|> do { r <- try td; rs <- scan; return (r:rs) }
<|> do { anyChar; scan }
I spend some time looking for an eof
combinator.
There is endOfInput, though it is not documented as eof. So it is easy to miss if rely only on search.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With