Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comma separated binary arguments? - elixir

Tags:

binary

elixir

I've been learning elixir this month, and was in a situation where I wanted to convert a binary object into a list of bits, for pattern matching.

My research led me here, to an article showing a method for doing so. However, I don't fully understand one of the arguments passed to the extract function.

I could just copy and paste the code, but I'd like to understand what's going on under the hood here.

The argument is this: <<b :: size(1), bits :: bitstring>>.

What I understand

I understand that << x >> denotes a binary object x. Logically to me, it looks as though this is similar to performing: [head | tail] = list on a List, to get the first element, and then the remaining ones as a new list called tail.

What I don't understand

However, I'm not familiar with the syntax, and I have never seen :: in elixir, nor have I ever seen a binary object separated by a comma: ,. I also, haven't seen size(x) used in Elixir, and have never encountered a bitstring.

The Bottom Line


If someone, could explain exactly how the syntax for this argument breaks down, or point me towards a resource I would highly appreciate it.

For your convenience, the code from that article:

defmodule Bits do
  # this is the public api which allows you to pass any binary representation
  def extract(str) when is_binary(str) do
    extract(str, [])
  end

  # this function does the heavy lifting by matching the input binary to
  # a single bit and sends the rest of the bits recursively back to itself
  defp extract(<<b :: size(1), bits :: bitstring>>, acc) when is_bitstring(bits) do
    extract(bits, [b | acc])
  end

  # this is the terminal condition when we don't have anything more to extract
  defp extract(<<>>, acc), do: acc |> Enum.reverse
end

IO.inspect Bits.extract("!!") # => [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]
IO.inspect Bits.extract(<< 99 >>) #=> [0, 1, 1, 0, 0, 0, 1, 1]
like image 352
Dylan Avatar asked Jul 22 '19 17:07

Dylan


2 Answers

Elixir pattern matching seems mind blowingly easy to use for structured binary data.

Yep. You can thank the erlang inventors.

According to the documentation, <<x :: size(y)>> denotes a bitstring, whos decimal value is x and is represented by a string of bits that is y in length.

Let's dumb it down a bit: <<x :: size(y)>> is the integer x inserted into y bits. Examples:

<<1 :: size(1)>>  => 1
<<1 :: size(2)>>  => 01
<<1 :: size(3)>>  => 001
<<2 :: size(3)>>  => 010
<<2 :: size(4)>>  => 0010

The number of bits in the binary type is divisible by 8, so a binary type has a whole number of bytes (1 byte = 8 bits). The number of bits in a bitstring is not divisible by 8. That's the difference between the binary type and the bitstring type.

I understand that << x >> denotes a binary object x. Logically to me, it looks as though this is similar to performing: [head | tail] = list on a List, to get the first element, and then the remaining ones as a new list called tail.

Yes:

defmodule A do

  def show_list([]), do: :ok
  def show_list([head|tail]) do
    IO.puts head
    show_list(tail)
  end

  def show_binary(<<>>), do: :ok
  def show_binary(<<char::binary-size(1), rest::binary>>) do
    IO.puts char
    show_binary(rest)
  end

end

In iex:

iex(6)> A.show_list(["a", "b", "c"])    
a
b
c
:ok

iex(7)> "abc" = <<"abc">> = <<"a", "b", "c">> = <<97, 98, 99>>
"abc"

iex(9)> A.show_binary(<<97, 98, 99>>)   
a
b
c
:ok

Or you can interpret the integers in the binary as plain old integers:

  def show(<<>>), do: :ok

  def show(<<ascii_code::integer-size(8), rest::binary>>) do
    IO.puts ascii_code
    show(rest)
  end

In iex:

iex(6)> A.show(<<97, 98, 99>>)            
97
98
99
:ok

The utf8 type is super useful because it will grab as many bytes as required to get a whole utf8 character:

  def show(<<>>), do: :ok

  def show(<<char::utf8, rest::binary>>) do
    IO.puts char
    show(rest)
  end

In iex:

iex(8)> A.show("ۑ")
8364
235
:ok

As you can see, the uft8 type returns the unicode codepoint of the character. To get the character as a string/binary:

  def show(<<>>), do: :ok
  def show(<<codepoint::utf8, rest::binary>>) do
    IO.puts <<codepoint::utf8>>
    show(rest)
  end

You take the codepoint(an integer) and use it to create the binary/string <<codepoint::utf8>>.

In iex:

iex(1)> A.show("ۑ")
€                                                          
ë
:ok

You can't specify a size for the utf8 type, though, so if you want to read multiple utf8 characters, you have to specify multiple segments.

And of course, the segment rest::binary, i.e. a binary type with no size specified, is super useful. It can only appear at the end of a pattern, and rest::binary is like the greedy regex: (.*). The same goes for rest::bitstring.

Although the elixir docs don't mention it anywhere, the total number of bits in a segment, where a segment is one of those things:

     |              |          |    
     v              v          v
<< 1::size(8), 1::size(16), 1::size(1) >>

is actually unit * size, where each type has a default unit. The default type for a segment is integer, so the type for each segment above defaults to integer. An integer has a default unit of 1 bit, so the total number of bits in the first segment is: 8 * 1 bit = 8 bits. The default unit for the binary type is 8 bits, so a segment like:

<< char::binary-size(6)>>

has a total size of 6 * 8 bits = 48 bits. Equivalently, size(6) is just the number of bytes. You can specify the unit just like you can the size, e.g. <<1::integer-size(2)-unit(3)>>. The total bit size of that segment is: 2 * 3 bits = 6 bits.

However, I'm not familiar with the syntax

Check this out:

  def bitstr2bits(bitstr) do
    for <<bit::integer-size(1) <- bitstr>>, do: bit
  end

In iex:

iex(17)> A.bitstr2bits <<1::integer-size(2), 2::integer-size(2)>>   
[0, 1, 1, 0]

Equivalently:

iex(3)> A.bitstr2bits(<<0b01::integer-size(2), 0b10::integer-size(2)>>)
[0, 1, 1, 0]

Elixir tends to abstract away recursion with library functions, so usually you don't have to come up with your own recursive definitions like at your link. However, that link shows one of the standard, basic recursion tricks: adding an accumulator to the function call to gather results that you want the function to return. That function could also be written like this:

  def bitstr2bits(<<>>), do: [] 
  def bitstr2bits(<<bit::integer-size(1), rest::bitstring>>) do
    [bit | bitstr2bits(rest)]
  end

The accumulator function at the link is tail recursive, which means it takes up a constant (small) amount of memory--no matter how many recursive function calls are needed to step through the bitstring. A bitstring with 10 million bits? Requiring 10 million recursive function calls? That would only require a small amount of memory. In the old days, the alternate definition I posted could potentially crash your program because it would take up more and more memory for each recursive function call, and if the bitstring were long enough the amount of memory needed would be too large, and you would get stackoverflow and your program would crash. However, erlang has optimized away the disadvantages of recursive functions that are not tail recursive.

like image 74
7stud Avatar answered Nov 18 '22 12:11

7stud


You can read about all these here, short answer:

  • :: is similar as guard, like a when is_integer(a), in you case size(1) expect a 1 bit binary
  • , is a separator between matching binaries, like | in [x | []] or like comma in [a, b]
  • bitstring is a superset over binaries, you can read about it here and here, any binary can be respresented as bitstring
iex> ?h
104
iex> ?e
101
iex> ?l
108
iex> ?o
111
iex> <<104, 101, 108, 108, 111>>
"hello"
iex> [104, 101, 108, 108, 111]
'hello'

but not vice versa

iex> <<1, 2, 3>>
<<1, 2, 3>>
like image 1
apelsinka223 Avatar answered Nov 18 '22 11:11

apelsinka223