Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting by lines with Conduit on Windows

I am having some trouble using the conduit lib on to split Text by line.

The source data I'm working with, unfortunately, is extremely inconsistent with line endings, containing both \r\n and \n sequences in the same file.

I have found the lines function in Data.Conduit.Binary, but it "splits" by single byte, (\n, sensibly enough), which leaves me with a trailing \rin some cases.

I understand why the current implementation works the way it does, and I am mostly confident I could hack some kind of a solution together, but the only way I could think of to do this would be something like:

lines' = do
   loop $ T.pack ""
   where loop acc = do
         char <- await
         case char of
            Nothing -> return ()
            Just x -> do
            case (isOver $ acc `T.append` x) of
                    (True,y) -> yield y
                    (False,y) -> loop y
                    where isOver n
                           |  (T.takeEnd 2 n == _rLn)  = (True, T.dropEnd 2 n)
                           |  (T.takeEnd 1 n == _Ln)   = (True, T.dropEnd 1 n)
                           |  otherwise                =  (False,n)
                           where _rLn = T.pack $! "\r\n"
                                 _Ln = T.pack $! "\n"

... which seems inelegant, kludgy, and terribly slow.

It feels wrong to check the last two characters at every iteration, since all I really need to do is "remember" if the last character I read was \r, but I can't think of a sensible way to do that.

Is anyone aware of a better solution to this problem?

like image 443
Iron Gremlin Avatar asked Dec 23 '22 21:12

Iron Gremlin


1 Answers

Don't try to reinvent the wheel! We can still make something prettier using conduit-combinators. As a foreword, if your \r values never occur except sometimes before newlines, you can just filter them out straight up. That said, I am going to assume your case is more general - you only want to get rid of \r values that are followed by \n.

The idea is to use slidingWindowC to get two character chunks, then to map these chunks to their first character - unless the characters are "\r\n", in which case we drop both. Then, having removed all \r that followed by newlines, we can use the conduit linesUnboundedC.

{-# LANGUAGE TypeFamilies, FlexibleContexts #-}

import Data.Text (Text, singleton, empty)
import Data.MonoTraversable (Element, MonoFoldable)
import Conduit

main = runConduitRes $ (sourceFile "file.txt" :: Producer (ResourceT IO) Text)
                    .| linesUnboundedC'
                    .| printC

-- | Converted a chunked input of characters into lines delimited by \n or \r\n
linesUnboundedC'
  :: (Element a ~ Char, MonoFoldable a, Monad m) => ConduitM a Text m ()
linesUnboundedC' = concatMapC id
                .| slidingWindowC 2
                .| mapC (\cs@[c,_] -> if cs == "\r\n" then empty else singleton c)
                .| linesUnboundedC
like image 120
Alec Avatar answered Jan 03 '23 06:01

Alec