I have the following code that reads a wikipedia dump file (~50 GB) and delivers pages on request:
defmodule Pages do
def start_link(filename) do
pid = spawn_link(__MODULE__, :loop, [filename])
Process.register(pid, :pages)
pid
end
def next(xml_parser) do
send(xml_parser, {:get_next, self()})
receive do
{:next_page, page} -> page
end
end
def loop(filename) do
:xmerl_sax_parser.file(filename,
event_fun: &event_fun/3,
event_state: :top)
loop_done
end
defp loop_done do
receive do
{:get_next, from} -> send(from, {:next_page, nil})
end
loop_done
end
defp event_fun({:startElement, _, 'page', _, _}, _, :top) do
:page
end
defp event_fun({:startElement, _, 'text', _, _}, _, :page) do
:text
end
defp event_fun({:characters, chars}, _, :text) do
s = List.to_string(chars)
receive do
{:get_next, from} -> send(from, {:next_page, s})
end
:text
end
defp event_fun({:endElement, _, 'text', _}, _, :text) do
:page
end
defp event_fun({:endElement, _, 'page', _}, _, :page) do
:top
end
defp event_fun({:endDocument}, _, state) do
receive do
{:get_next, from} -> send(from, {:done})
end
state
end
defp event_fun(_, _, state) do
state
end
end
Since the code uses SAX
parser I would expect constant memory footprint. When I try to read first 2000 pages using
Enum.each(1..2000, fn(x) -> Pages.next(Process.whereis(:pages)); end)
the :pages
process uses 1,1 GB
of memory according to :observer.start()
. When I try to read 10000 pages, the whole thing crashes:
Crash dump is being written to: erl_crash.dump...done
eheap_alloc: Cannot allocate 5668310376 bytes of memory (of type "heap").
When I open erl_crash.dump
using dump viewer I see the following:
Is something wrong with the code above? Is GC not quick enough? Although I can see the memory per process it doesn't tell me a lot. How can I see where this memory actually goes?
P.S. Here is a link to a crash dump from today: https://ufile.io/becba.
The number of atoms is 14490, the MsgQ
is 2 for :pages
and 0 for all other processes.
The default max number of atoms is slightly over 1 million atoms. Given the English Wikipedia has over 5 million articles and xmerl seems to create an atom for each namespace URI, I think it may be the culprit.
Also, trying the code below on Elixir fails with just a "stack smashing error".
Enum.each(1..2000000, fn (x) ->
x
|> Integer.to_string
|> String.to_atom
end)
But if I raise the atom limit to something like 5 million with the environment variable ELIXIR_ERL_OPTIONS="+t 5000000"
, the problem vanishes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With