Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large file and hashing - performance concern

Tags:

elixir

I'm trying to hash a file (16 MB) line by line with the following code :

 def hash(data, protocol) do
   :crypto.hash(protocol, data)
   |> Base.encode16()
 end

 File.stream!(path)
 |> Stream.map(&hash(&1, :md5) <> "h")
 |> Enum.to_list()
 |> hd()
 |> IO.puts()

According to time command, this takes between 10 to 12 seconds, which seems to be a huge number to, me considering that with the following Python code :

import md5

with open('a', 'r') as f:
    content = f.readlines()
    l = []
    for _, val in enumerate(content):
        m = md5.new()
        m.update(val)
        l.append(m.hexdigest() + "h")

    print l[0]

runs (still according to time) in about 2.3 seconds.

Where would I stard to improve the performance of my Elixir code ? I tried to to split the initial stream into 10 chunks, and fire an asynchronous task for each of them :

File.stream!(path)
|> Stream.chunk(chunk_size) # with chunk_size being (nb_of_lines_in_file / 10)
|> Enum.map(fn chunk -> Task.async(fn -> Enum.map(chunk, &hash(&1, :md5) <> "h") end) end)
|> Enum.flat_map(&Task.await/1)
|> hd()
|> IO.puts()

but it yields even or worse results, about 11+ seconds to run, why is that ?

like image 860
Kernael Avatar asked Dec 18 '15 12:12

Kernael


1 Answers

One thing to take into account is that using time to record the performance of Elixir code is always going to take into account the start up time of the BEAM virtual machine. Depending on your application, it may or may not make sense to include this in any comparision benchmarks to other languages. If you just want to maximize the performance of Elixir code, it's best to use a benchmarking tool like Benchfella or even just :timer.tc from erlang.

https://hex.pm/packages/benchfella

My guess is that your performance problems are all I/O related. File.stream! is not particularly efficient for line processing of large files.

I wrote a blog post on a similar problem of hashing the entire file.

http://www.cursingthedarkness.com/2015/06/micro-benchmarking-in-elixir-using.html

And there is a slide talk about doing fast line based processing here.

http://bbense.github.io/beatwc/

I think if you slurp the whole file in you'll get better performance. I would not hesitate at all to just use

File.stream!(path) |> Enum.map(fn(line) -> hash(line, :md5) <> "h" end )

for a 16mb file. Using a Stream in a pipeline almost always trades speed for memory use. Since data is immutable in Elixir, large lists generally have less overhead than you would initially expect.

Your task based code won't help much since I suspect the majority of the time is spent in chunking the lines in these two lines.

File.stream!(path)
|> Stream.chunk(chunk_size) # with chunk_size being (nb_of_lines_in_file / 10)

That's going to be really slow. Another code example you might find useful. https://github.com/dimroc/etl-language-comparison/tree/master/elixir

There are a lot of tricks you can use to get fast file processing in Elixir. You can often improve the speed from the naive File.stream! version by multiple orders of magnitude.

like image 138
Fred the Magic Wonder Dog Avatar answered Sep 29 '22 14:09

Fred the Magic Wonder Dog