Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data.ByteString.Lazy.Char8 newline conversion on Windows---is the documentation misleading?

Tags:

haskell

I have a question about the Data.ByteString.Lazy.Char8 library in the bytestring library. Specifically, my question concerns the readFile function, which is documented as follows:

Read an entire file lazily into a ByteString. Use 'text mode' on Windows to interpret newlines

I'm interested in the claim that this function will 'use text mode on Windows to interpret newlines'. The source code for the function is as follows:

-- | Read an entire file /lazily/ into a 'ByteString'. Use 'text mode'
-- on Windows to interpret newlines
readFile :: FilePath -> IO ByteString
readFile f = openFile f ReadMode >>= hGetContents

and we see that, in one sense, the claim in the documentation is perfectly true: the openFile function (as opposed to openBinaryFile) has been used, and so newline conversion will be enabled for the file.

But, the file will then be passed to hGetContents. This will call Data.ByteString.hGetNonBlocking (see the source code here and here), which is meant to be a non blocking version of Data.ByteString.hGet (see the documentation); and (finally) Data.ByteString.hGet calls GHC.IO.Handle.hGetBuf (see the documentation or the source code). This function's documentation says that

hGetBuf ignores whatever TextEncoding the Handle is currently using, and reads bytes directly from the underlying IO device.

which suggests that the fact that we opened the file using readFile rather than readBinaryFile is irrelevant: the data will be read without transforming newlines, notwithstanding the claim in the documentation referred to at the beginning of the question.

So, the nub of the question: 1. Am I missing something? Is there some sense in which the statement 'that Data.ByteString.Lazy.Char8.readFile uses text mode on Windows to interpret newlines' is true? Or is the documentation just misleading?

P.S. Testing also indicates that this function, at least when used naively as I was using it, does no newline conversion on Windows.

like image 479
circular-ruin Avatar asked Jul 26 '11 23:07

circular-ruin


2 Answers

FWIW, the package maintainer, Duncan Coutts, responded with some very helpful and enlightening remarks. I've asked for his permission to post them here, but in the interim here is a paraphrase.

The basic point is that the documentation was once correct, but now probably is not. In particular, when one opens a file in windows, the operating system itself lets you open it in 'text' or 'binary' modes. The difference between readFile and readBinaryFile used to be that one would open the file in the OS's text mode and one in binary mode on Win32. (They would both do the same on POSIX.) Critically, if you opened a file in the OS's binary mode, there was no way you could read from the file without newline conversion: it happened always.

When things were set up like this, the documentation referred to in the question was correct---Data.ByteString.Lazy.Char8.readFile would use System.IO.readFile; this would tell the OS to open the file 'Text', and newlines would be converted, even though hGetBuf was being used.

Then, later, Haskell's System.IO was souped up to make its handling of newlines more flexible---specifically to allow versions of Haskell running on POSIX OSs, where there is no functionality to read files with newline mangling built into the OS, nonetheless to support reading files with Windows style newlines; or more accurately to support Python-style 'universal' newline conversion on both OSs. This meant that:

  1. The handling of newlines was brought into the Haskell libraries;
  2. Files are always opened in binary mode on Windows, whether you use readFile or readBinaryFile; and
  3. Instead, the choice between readFile and readBinaryFile would affect whether System.IO's library code was set up to be in nativeNewlineMode or noNewlineTranslation. This would then cause the Haskell library conversion to do appropriate newline conversion for you. You could now also choose to ask for universalNewlineMode.

This was at about the same time as Haskell got proper encoding support built in to System.IO (rather than assuming latin-1 on input and simply truncating output Chars to their first 8 bits). Overall, it was a Good Thing.

But, critically, the new newline conversion, now built in to the libraries, never affects what hPutBuf does---presumably because the people building the new System.IO functionality thought that if one was reading the fine in a binary way, any newline conversion interposing itself was probably not What the Programmer Wanted, i.e. was a mistake. And indeed, it probably is in 99% of cases: but in this case, it causes the problem above :-)

Duncan says that the docs will probably change to reflect this brave new world in future releases of the library. In the interim, there is a workaround listed in another answer to this question.

like image 168
circular-ruin Avatar answered Oct 18 '22 06:10

circular-ruin


Digging one more layer into the source shows it does read raw bytes:

-- | 'hGetBuf' @hdl buf count@ reads data from the handle @hdl@
-- into the buffer @buf@ until either EOF is reached or
-- @count@ 8-bit bytes have been read.
-- It returns the number of bytes actually read.  This may be zero if
-- EOF was reached before any data was read (or if @count@ is zero).
--
-- 'hGetBuf' never raises an EOF exception, instead it returns a value
-- smaller than @count@.
--
-- If the handle is a pipe or socket, and the writing end
-- is closed, 'hGetBuf' will behave as if EOF was reached.
--
-- 'hGetBuf' ignores the prevailing 'TextEncoding' and 'NewlineMode'
-- on the 'Handle', and reads bytes directly.

hGetBuf :: Handle -> Ptr a -> Int -> IO Int
hGetBuf h ptr count
  | count == 0 = return 0
  | count <  0 = illegalBufferSize h "hGetBuf" count
  | otherwise = 
      wantReadableHandle_ "hGetBuf" h $ \ h_@Handle__{..} -> do
         flushCharReadBuffer h_
         buf@Buffer{ bufRaw=raw, bufR=w, bufL=r, bufSize=sz }
            <- readIORef haByteBuffer
         if isEmptyBuffer buf
            then bufReadEmpty    h_ buf (castPtr ptr) 0 count
            else bufReadNonEmpty h_ buf (castPtr ptr) 0 count
like image 42
Chris Kuklewicz Avatar answered Oct 18 '22 06:10

Chris Kuklewicz