Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do file reads differ depending on STDIO or File.open?

Tags:

elixir

Question

It seems that reading a file behaves differently based on whether I'm pulling from :stdio or opening the file. Why? I would like to be able to read a binary file from either STDIN or by opening the file (File.open) and use the same code to extract the bytes.

In short:

  • Why does the behavior I've outlined below differ between stdio and files?
  • How might I accomplish my goal?

Test case

As a simple test case, I have a binary file containing three bytes:

06 8C 7D

My desired result is that reading this file, from either source, should yield a binary of the form:

<<6, 140, 125>>

However, things seem to differ based on whether I'm reading from STDIO or opening the file.

Here are a series of test cases demonstrating the behavior.

Example 1

IO.binread of stdio yields this error

IO.inspect IO.binread(:stdio, 3)
$ elixir repro.exs < repro.bin
{:error, :collect_chars}

Example 2

IO.read of stdio yields desired result

IO.inspect IO.read(:stdio, 3)
$ elixir repro.exs < repro.bin
<<6, 140, 125>>

Example 3

IO.binread of a file produces desired result

{:ok, file} = File.open("repro.bin")
IO.inspect IO.binread(file, 3)
$ elixir repro.exs
<<6, 140, 125>>

Example 4

IO.read of file adds an extra byte (194), which I don't understand — my best guess is this has something to do with utf8?

{:ok, file} = File.open("repro.bin")
IO.inspect IO.read(file, 3)
$ elixir repro.exs
<<6, 194, 140, 125>>

What I would like:

A way to accept either a file or stdio and treat either device the same. Right now, it seems I can't. Despite my best googling, I find myself stuck.

Any insights?

like image 462
Matt Baker Avatar asked Dec 24 '15 23:12

Matt Baker


1 Answers

An answer from José Valim on the elixir google group:

The answer to your question is in which encoding the source is. STDIO is by default in unicode, which means it is not suitable for binread. This is documented in the binread function and is currently an Erlang bug/limitation. To find out the encoding, use getopts:

iex> :io.getopts :standard_io [expand_fun: &IEx.Autocomplete.expand/1, echo: true, binary: true, encoding: :unicode]

On the other hand, File is in latin, which means read will attempt to convert and binread will return the raw bytes. You can try to use :io.setopts and see if you get the desired result:

iex> io.setopts :standard_io, encoding: :latin1

I am aware the situation is not ideal. It would be nice if binread could always read bytes regardless of the encoding of the file. I have written a report here: http://erlang.org/pipermail/erlang-bugs/2014-July/004498.html

To sum up:

  • read will always attempt to do a conversion to the device encoding
  • binread should always return raw binaries but there is a bug when it comes to unicode (which is the default for IO devices)

The odd "injection" of an extra byte (194) that I saw seems to be elixir/erlang trying to interpret the bin as utf8.

Per his suggestion, setting the encoding of stdio directly seems to do the trick:

test_read = fn(device) ->
  IO.binread(device, 3)
end

#set stdio's encoding to latin1
:io.setopts(:standard_io, encoding: :latin1)

# Test the read against stdio
IO.inspect test_read.(:stdio)

#grab a file descriptor
{:ok, fd} = File.open("repro.bin")

# Test the same read against a file
IO.inspect test_read.(fd)

Output:

$ elixir repro.exs < repro.bin
<<6, 140, 125>>
<<6, 140, 125>>
like image 187
Matt Baker Avatar answered Sep 18 '22 13:09

Matt Baker