It seems that reading a file behaves differently based on whether I'm pulling from :stdio
or opening the file. Why? I would like to be able to read a binary file from either STDIN or by opening the file (File.open
) and use the same code to extract the bytes.
In short:
As a simple test case, I have a binary file containing three bytes:
06 8C 7D
My desired result is that reading this file, from either source, should yield a binary of the form:
<<6, 140, 125>>
However, things seem to differ based on whether I'm reading from STDIO or opening the file.
Here are a series of test cases demonstrating the behavior.
IO.binread of stdio yields this error
IO.inspect IO.binread(:stdio, 3)
$ elixir repro.exs < repro.bin
{:error, :collect_chars}
IO.read of stdio yields desired result
IO.inspect IO.read(:stdio, 3)
$ elixir repro.exs < repro.bin
<<6, 140, 125>>
IO.binread of a file produces desired result
{:ok, file} = File.open("repro.bin")
IO.inspect IO.binread(file, 3)
$ elixir repro.exs
<<6, 140, 125>>
IO.read of file adds an extra byte (194), which I don't understand — my best guess is this has something to do with utf8?
{:ok, file} = File.open("repro.bin")
IO.inspect IO.read(file, 3)
$ elixir repro.exs
<<6, 194, 140, 125>>
A way to accept either a file or stdio and treat either device the same. Right now, it seems I can't. Despite my best googling, I find myself stuck.
Any insights?
An answer from José Valim on the elixir google group:
The answer to your question is in which encoding the source is. STDIO is by default in unicode, which means it is not suitable for binread. This is documented in the binread function and is currently an Erlang bug/limitation. To find out the encoding, use getopts:
iex> :io.getopts :standard_io [expand_fun: &IEx.Autocomplete.expand/1, echo: true, binary: true, encoding: :unicode]
On the other hand, File is in latin, which means read will attempt to convert and binread will return the raw bytes. You can try to use :io.setopts and see if you get the desired result:
iex> io.setopts :standard_io, encoding: :latin1
I am aware the situation is not ideal. It would be nice if binread could always read bytes regardless of the encoding of the file. I have written a report here: http://erlang.org/pipermail/erlang-bugs/2014-July/004498.html
To sum up:
- read will always attempt to do a conversion to the device encoding
- binread should always return raw binaries but there is a bug when it comes to unicode (which is the default for IO devices)
The odd "injection" of an extra byte (194) that I saw seems to be elixir/erlang trying to interpret the bin as utf8.
Per his suggestion, setting the encoding of stdio directly seems to do the trick:
test_read = fn(device) ->
IO.binread(device, 3)
end
#set stdio's encoding to latin1
:io.setopts(:standard_io, encoding: :latin1)
# Test the read against stdio
IO.inspect test_read.(:stdio)
#grab a file descriptor
{:ok, fd} = File.open("repro.bin")
# Test the same read against a file
IO.inspect test_read.(fd)
Output:
$ elixir repro.exs < repro.bin
<<6, 140, 125>>
<<6, 140, 125>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With