I am having some trouble using the conduit lib on to split Text by line.
The source data I'm working with, unfortunately, is extremely inconsistent with line endings, containing both \r\n
and \n
sequences in the same file.
I have found the lines
function in Data.Conduit.Binary
, but it "splits" by single byte, (\n
, sensibly enough), which leaves me with a trailing \r
in some cases.
I understand why the current implementation works the way it does, and I am mostly confident I could hack some kind of a solution together, but the only way I could think of to do this would be something like:
lines' = do
loop $ T.pack ""
where loop acc = do
char <- await
case char of
Nothing -> return ()
Just x -> do
case (isOver $ acc `T.append` x) of
(True,y) -> yield y
(False,y) -> loop y
where isOver n
| (T.takeEnd 2 n == _rLn) = (True, T.dropEnd 2 n)
| (T.takeEnd 1 n == _Ln) = (True, T.dropEnd 1 n)
| otherwise = (False,n)
where _rLn = T.pack $! "\r\n"
_Ln = T.pack $! "\n"
... which seems inelegant, kludgy, and terribly slow.
It feels wrong to check the last two characters at every iteration, since all I really need to do is "remember" if the last character I read was \r
, but I can't think of a sensible way to do that.
Is anyone aware of a better solution to this problem?
Don't try to reinvent the wheel! We can still make something prettier using conduit-combinators
. As a foreword, if your \r
values never occur except sometimes before newlines, you can just filter them out straight up. That said, I am going to assume your case is more general - you only want to get rid of \r
values that are followed by \n
.
The idea is to use slidingWindowC
to get two character chunks, then to map these chunks to their first character - unless the characters are "\r\n"
, in which case we drop both. Then, having removed all \r
that followed by newlines, we can use the conduit linesUnboundedC
.
{-# LANGUAGE TypeFamilies, FlexibleContexts #-}
import Data.Text (Text, singleton, empty)
import Data.MonoTraversable (Element, MonoFoldable)
import Conduit
main = runConduitRes $ (sourceFile "file.txt" :: Producer (ResourceT IO) Text)
.| linesUnboundedC'
.| printC
-- | Converted a chunked input of characters into lines delimited by \n or \r\n
linesUnboundedC'
:: (Element a ~ Char, MonoFoldable a, Monad m) => ConduitM a Text m ()
linesUnboundedC' = concatMapC id
.| slidingWindowC 2
.| mapC (\cs@[c,_] -> if cs == "\r\n" then empty else singleton c)
.| linesUnboundedC
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With