I have a question about the Data.ByteString.Lazy.Char8 library in the bytestring library. Specifically, my question concerns the readFile function, which is documented as follows:
Read an entire file lazily into a ByteString. Use 'text mode' on Windows to interpret newlines
I'm interested in the claim that this function will 'use text mode on Windows to interpret newlines'. The source code for the function is as follows:
-- | Read an entire file /lazily/ into a 'ByteString'. Use 'text mode'
-- on Windows to interpret newlines
readFile :: FilePath -> IO ByteString
readFile f = openFile f ReadMode >>= hGetContents
and we see that, in one sense, the claim in the documentation is perfectly true: the openFile
function (as opposed to openBinaryFile
) has been used, and so newline conversion will be enabled for the file.
But, the file will then be passed to hGetContents. This will call Data.ByteString.hGetNonBlocking
(see the source code here and here), which is meant to be a non blocking version of Data.ByteString.hGet
(see the documentation); and (finally) Data.ByteString.hGet
calls GHC.IO.Handle.hGetBuf
(see the documentation or the source code). This function's documentation says that
hGetBuf ignores whatever TextEncoding the Handle is currently using, and reads bytes directly from the underlying IO device.
which suggests that the fact that we opened the file using readFile
rather than readBinaryFile
is irrelevant: the data will be read without transforming newlines, notwithstanding the claim in the documentation referred to at the beginning of the question.
So, the nub of the question: 1. Am I missing something? Is there some sense in which the statement 'that Data.ByteString.Lazy.Char8.readFile uses text mode on Windows to interpret newlines' is true? Or is the documentation just misleading?
P.S. Testing also indicates that this function, at least when used naively as I was using it, does no newline conversion on Windows.
FWIW, the package maintainer, Duncan Coutts, responded with some very helpful and enlightening remarks. I've asked for his permission to post them here, but in the interim here is a paraphrase.
The basic point is that the documentation was once correct, but now probably is not. In particular, when one opens a file in windows, the operating system itself lets you open it in 'text' or 'binary' modes. The difference between readFile
and readBinaryFile
used to be that one would open the file in the OS's text mode and one in binary mode on Win32. (They would both do the same on POSIX.) Critically, if you opened a file in the OS's binary mode, there was no way you could read from the file without newline conversion: it happened always.
When things were set up like this, the documentation referred to in the question was correct---Data.ByteString.Lazy.Char8.readFile
would use System.IO.readFile
; this would tell the OS to open the file 'Text', and newlines would be converted, even though hGetBuf
was being used.
Then, later, Haskell's System.IO
was souped up to make its handling of newlines more flexible---specifically to allow versions of Haskell running on POSIX OSs, where there is no functionality to read files with newline mangling built into the OS, nonetheless to support reading files with Windows style newlines; or more accurately to support Python-style 'universal' newline conversion on both OSs. This meant that:
readFile
or readBinaryFile
; and readFile
and readBinaryFile
would affect whether System.IO
's library code was set up to be in nativeNewlineMode
or noNewlineTranslation
. This would then cause the Haskell library conversion to do appropriate newline conversion for you. You could now also choose to ask for universalNewlineMode
.This was at about the same time as Haskell got proper encoding support built in to System.IO
(rather than assuming latin-1 on input and simply truncating output Chars to their first 8 bits). Overall, it was a Good Thing.
But, critically, the new newline conversion, now built in to the libraries, never affects what hPutBuf
does---presumably because the people building the new System.IO
functionality thought that if one was reading the fine in a binary way, any newline conversion interposing itself was probably not What the Programmer Wanted, i.e. was a mistake. And indeed, it probably is in 99% of cases: but in this case, it causes the problem above :-)
Duncan says that the docs will probably change to reflect this brave new world in future releases of the library. In the interim, there is a workaround listed in another answer to this question.
Digging one more layer into the source shows it does read raw bytes:
-- | 'hGetBuf' @hdl buf count@ reads data from the handle @hdl@
-- into the buffer @buf@ until either EOF is reached or
-- @count@ 8-bit bytes have been read.
-- It returns the number of bytes actually read. This may be zero if
-- EOF was reached before any data was read (or if @count@ is zero).
--
-- 'hGetBuf' never raises an EOF exception, instead it returns a value
-- smaller than @count@.
--
-- If the handle is a pipe or socket, and the writing end
-- is closed, 'hGetBuf' will behave as if EOF was reached.
--
-- 'hGetBuf' ignores the prevailing 'TextEncoding' and 'NewlineMode'
-- on the 'Handle', and reads bytes directly.
hGetBuf :: Handle -> Ptr a -> Int -> IO Int
hGetBuf h ptr count
| count == 0 = return 0
| count < 0 = illegalBufferSize h "hGetBuf" count
| otherwise =
wantReadableHandle_ "hGetBuf" h $ \ h_@Handle__{..} -> do
flushCharReadBuffer h_
buf@Buffer{ bufRaw=raw, bufR=w, bufL=r, bufSize=sz }
<- readIORef haByteBuffer
if isEmptyBuffer buf
then bufReadEmpty h_ buf (castPtr ptr) 0 count
else bufReadNonEmpty h_ buf (castPtr ptr) 0 count
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With