I'm trying to hash a file (16 MB) line by line with the following code :
def hash(data, protocol) do
:crypto.hash(protocol, data)
|> Base.encode16()
end
File.stream!(path)
|> Stream.map(&hash(&1, :md5) <> "h")
|> Enum.to_list()
|> hd()
|> IO.puts()
According to time
command, this takes between 10 to 12 seconds, which seems to be a huge number to, me considering that with the following Python code :
import md5
with open('a', 'r') as f:
content = f.readlines()
l = []
for _, val in enumerate(content):
m = md5.new()
m.update(val)
l.append(m.hexdigest() + "h")
print l[0]
runs (still according to time
) in about 2.3 seconds.
Where would I stard to improve the performance of my Elixir code ? I tried to to split the initial stream into 10 chunks, and fire an asynchronous task for each of them :
File.stream!(path)
|> Stream.chunk(chunk_size) # with chunk_size being (nb_of_lines_in_file / 10)
|> Enum.map(fn chunk -> Task.async(fn -> Enum.map(chunk, &hash(&1, :md5) <> "h") end) end)
|> Enum.flat_map(&Task.await/1)
|> hd()
|> IO.puts()
but it yields even or worse results, about 11+ seconds to run, why is that ?
One thing to take into account is that using time to record the performance of Elixir code is always going to take into account the start up time of the BEAM virtual machine. Depending on your application, it may or may not make sense to include this in any comparision benchmarks to other languages. If you just want to maximize the performance of Elixir code, it's best to use a benchmarking tool like Benchfella or even just :timer.tc from erlang.
https://hex.pm/packages/benchfella
My guess is that your performance problems are all I/O related.
File.stream!
is not particularly efficient for line processing of large files.
I wrote a blog post on a similar problem of hashing the entire file.
http://www.cursingthedarkness.com/2015/06/micro-benchmarking-in-elixir-using.html
And there is a slide talk about doing fast line based processing here.
http://bbense.github.io/beatwc/
I think if you slurp the whole file in you'll get better performance. I would not hesitate at all to just use
File.stream!(path) |> Enum.map(fn(line) -> hash(line, :md5) <> "h" end )
for a 16mb file. Using a Stream in a pipeline almost always trades speed for memory use. Since data is immutable in Elixir, large lists generally have less overhead than you would initially expect.
Your task based code won't help much since I suspect the majority of the time is spent in chunking the lines in these two lines.
File.stream!(path)
|> Stream.chunk(chunk_size) # with chunk_size being (nb_of_lines_in_file / 10)
That's going to be really slow. Another code example you might find useful. https://github.com/dimroc/etl-language-comparison/tree/master/elixir
There are a lot of tricks you can use to get fast file processing in Elixir. You can often improve the speed from the naive File.stream!
version by multiple orders of magnitude.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With