Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading large XML file using xmerl crashes the node

Tags:

elixir

I have the following code that reads a wikipedia dump file (~50 GB) and delivers pages on request:

defmodule Pages do
  def start_link(filename) do
    pid = spawn_link(__MODULE__, :loop, [filename])
    Process.register(pid, :pages)
    pid
  end

  def next(xml_parser) do
    send(xml_parser, {:get_next, self()})
    receive do
      {:next_page, page} -> page
    end
  end

  def loop(filename) do
    :xmerl_sax_parser.file(filename,
      event_fun: &event_fun/3,
      event_state: :top)
    loop_done
  end

  defp loop_done do
    receive do
      {:get_next, from} -> send(from, {:next_page, nil})
    end
    loop_done
  end

  defp event_fun({:startElement, _, 'page', _, _}, _, :top) do
    :page
  end

  defp event_fun({:startElement, _, 'text', _, _}, _, :page) do
    :text
  end

  defp event_fun({:characters, chars}, _, :text) do
    s = List.to_string(chars)
    receive do
      {:get_next, from} -> send(from, {:next_page, s})
    end
    :text
  end

  defp event_fun({:endElement, _, 'text', _}, _, :text) do
    :page
  end

  defp event_fun({:endElement, _, 'page', _}, _, :page) do
    :top
  end

  defp event_fun({:endDocument}, _, state) do
    receive do
      {:get_next, from} -> send(from, {:done})
    end
    state
  end

  defp event_fun(_, _, state) do
    state
  end
end

Since the code uses SAX parser I would expect constant memory footprint. When I try to read first 2000 pages using

Enum.each(1..2000, fn(x) -> Pages.next(Process.whereis(:pages)); end)

the :pages process uses 1,1 GB of memory according to :observer.start(). When I try to read 10000 pages, the whole thing crashes:

Crash dump is being written to: erl_crash.dump...done
eheap_alloc: Cannot allocate 5668310376 bytes of memory (of type "heap").

When I open erl_crash.dump using dump viewer I see the following: enter image description here

Is something wrong with the code above? Is GC not quick enough? Although I can see the memory per process it doesn't tell me a lot. How can I see where this memory actually goes?

P.S. Here is a link to a crash dump from today: https://ufile.io/becba. The number of atoms is 14490, the MsgQ is 2 for :pages and 0 for all other processes.

like image 639
Konstantin Milyutin Avatar asked Oct 29 '22 21:10

Konstantin Milyutin


1 Answers

The default max number of atoms is slightly over 1 million atoms. Given the English Wikipedia has over 5 million articles and xmerl seems to create an atom for each namespace URI, I think it may be the culprit.

Also, trying the code below on Elixir fails with just a "stack smashing error".

Enum.each(1..2000000, fn (x) ->
  x
  |> Integer.to_string
  |> String.to_atom
end) 

But if I raise the atom limit to something like 5 million with the environment variable ELIXIR_ERL_OPTIONS="+t 5000000", the problem vanishes.

like image 125
Lauro Moura Avatar answered Nov 29 '22 11:11

Lauro Moura