Large file and hashing - performance concern

Question

I'm trying to hash a file (16 MB) line by line with the following code :

 def hash(data, protocol) do
   :crypto.hash(protocol, data)
   |> Base.encode16()
 end

 File.stream!(path)
 |> Stream.map(&hash(&1, :md5) <> "h")
 |> Enum.to_list()
 |> hd()
 |> IO.puts()

According to time command, this takes between 10 to 12 seconds, which seems to be a huge number to, me considering that with the following Python code :

import md5

with open('a', 'r') as f:
    content = f.readlines()
    l = []
    for _, val in enumerate(content):
        m = md5.new()
        m.update(val)
        l.append(m.hexdigest() + "h")

    print l[0]

runs (still according to time) in about 2.3 seconds.

Where would I stard to improve the performance of my Elixir code ? I tried to to split the initial stream into 10 chunks, and fire an asynchronous task for each of them :

File.stream!(path)
|> Stream.chunk(chunk_size) # with chunk_size being (nb_of_lines_in_file / 10)
|> Enum.map(fn chunk -> Task.async(fn -> Enum.map(chunk, &hash(&1, :md5) <> "h") end) end)
|> Enum.flat_map(&Task.await/1)
|> hd()
|> IO.puts()

but it yields even or worse results, about 11+ seconds to run, why is that ?

Fred the Magic Wonder Dog · Accepted Answer

One thing to take into account is that using time to record the performance of Elixir code is always going to take into account the start up time of the BEAM virtual machine. Depending on your application, it may or may not make sense to include this in any comparision benchmarks to other languages. If you just want to maximize the performance of Elixir code, it's best to use a benchmarking tool like Benchfella or even just :timer.tc from erlang.

https://hex.pm/packages/benchfella

My guess is that your performance problems are all I/O related. File.stream! is not particularly efficient for line processing of large files.

I wrote a blog post on a similar problem of hashing the entire file.

http://www.cursingthedarkness.com/2015/06/micro-benchmarking-in-elixir-using.html

And there is a slide talk about doing fast line based processing here.

http://bbense.github.io/beatwc/

I think if you slurp the whole file in you'll get better performance. I would not hesitate at all to just use

File.stream!(path) |> Enum.map(fn(line) -> hash(line, :md5) <> "h" end )

for a 16mb file. Using a Stream in a pipeline almost always trades speed for memory use. Since data is immutable in Elixir, large lists generally have less overhead than you would initially expect.

Your task based code won't help much since I suspect the majority of the time is spent in chunking the lines in these two lines.

File.stream!(path)
|> Stream.chunk(chunk_size) # with chunk_size being (nb_of_lines_in_file / 10)

That's going to be really slow. Another code example you might find useful. https://github.com/dimroc/etl-language-comparison/tree/master/elixir

There are a lot of tricks you can use to get fast file processing in Elixir. You can often improve the speed from the naive File.stream! version by multiple orders of magnitude.

Large file and hashing - performance concern

Tags:

elixir

Kernael

1 Answers

Fred the Magic Wonder Dog

Recent Activity

Donate For Us

Large file and hashing - performance concern

Tags:

elixir

Kernael

1 Answers

Fred the Magic Wonder Dog

Related questions

Recent Activity

Donate For Us