((Please forgive me that I ask more than one question in a single thread. I think they are related.))
Hello, I wanted to know, what best practices exist in Erlang in regards to per-module precompiled data.
Example: I have a module that heavily operates on a priory know, veeery complex regular expressions. re:compile/2's documentations says: “Compiling once and executing many times is far more efficient than compiling each time one wants to match”. Since re's mp() datatype is in no way specified, and as such cannot be put at compile time if you want a target-independ beam, one has to compile the RegEx at runtime. ((Note: re:compile/2 is only an example. Any complex function to memoize would fit my question.))
Erlang's module (can) have an -on_load(F/A)
attribute, denoting a method that should executed once when the module is loaded. As such, I could place my regexes to compile in this method and save the result in a new ets table named ?MODULE
.
My questions are:
Working example:
-module(memoization).
-export([is_ipv4/1, fillCacheLoop/0]).
-record(?MODULE, { re_ipv4 = re_ipv4() }).
-on_load(fillCache/0).
fillCacheLoop() ->
receive
{ replace, NewData, Callback, Ref } ->
true = ets:insert(?MODULE, [{ data, {self(), NewData} }]),
Callback ! { on_load, Ref, ok },
?MODULE:fillCacheLoop();
purge ->
ok
end
.
fillCache() ->
Callback = self(),
Ref = make_ref(),
process_flag(trap_exit, true),
Pid = spawn_link(fun() ->
case catch ets:lookup(?MODULE, data) of
[{data, {TableOwner,_} }] ->
TableOwner ! { replace, #?MODULE{}, self(), Ref },
receive
{ on_load, Ref, Result } ->
Callback ! { on_load, Ref, Result }
end,
ok;
_ ->
?MODULE = ets:new(?MODULE, [named_table, {read_concurrency,true}]),
true = ets:insert_new(?MODULE, [{ data, {self(), #?MODULE{}} }]),
Callback ! { on_load, Ref, ok },
fillCacheLoop()
end
end),
receive
{ on_load, Ref, Result } ->
unlink(Pid),
Result;
{ 'EXIT', Pid, Result } ->
Result
after 1000 ->
error
end
.
is_ipv4(Addr) ->
Data = case get(?MODULE.data) of
undefined ->
[{data, {_,Result} }] = ets:lookup(?MODULE, data),
put(?MODULE.data, Result),
Result;
SomeDatum -> SomeDatum
end,
re:run(Addr, Data#?MODULE.re_ipv4)
.
re_ipv4() ->
{ok, Result} = re:compile("^0*"
"([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
"([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
"([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])\\.0*"
"([1-9]?\\d|1\\d\\d|2[0-4]\\d|25[0-5])$"),
Result
.
You have another option. You can precompute the regular expression's compiled form and refer to it directly. One way to do this is to use a module designed specifically for this purpose such as ct_expand
: http://dukesoferl.blogspot.com/2009/08/metaprogramming-with-ctexpand.html
You can also roll your own by generating a module on the fly with a function to return this value as a constant (taking advantage of the constant pool): http://erlang.org/pipermail/erlang-questions/2011-January/056007.html
Or you could even run This wouldn't be portable in case the implementation changes.re:compile
in a shell and copy and paste the result into your code. Crude but effective.
To be clear: all of these take advantage of the constant pool to avoid recomputing every time. But of course, this is added complexity and it has a cost.
Coming back to your original question: the problem with the process dictionary is that, well, it can only be used by its own process. Are you certain this module's functions will only be called by the same process? Even ETS tables are tied to the process that creates them (ETS is not itself implemented using processes and message passing, though) and will die if that process dies.
ETS isn't implemented in a process and doesn't have its data in a separate process heap, but it does have its data in a separate area outside of all processes. This means that when reading/writing to ETS tables data must be copied to/from processes. How costly this is depends, of course, on the amount of data being copied. This is one reason why we have functions like ets:match_object
and ets:select
which allow more complex selection rules before data is copied.
One benefit of keeping your data in an ETS table is that it can be reached by all processes not just the process which owns the table. This can make it more efficient than keeping your data in a server. It also depends on what type of operations you want to do on your data. ETS is just a data store and provides limited atomicity. In your case that is probably no problem.
You should definitely keep you data in separate records, one for each different compiled regular expression, as it will greatly increase the access speed. You can then directly get the re you are after, otherwise you will get them all and then search again after the one you want. That sort of defeats the point of putting them in ETS.
While you can do things like building ETS tables in on_load
functions it is not a good idea for ETS tables. This is because an ETS is owned by a process and is deleted when the process dies. You never really know in which process the on_load
function is called. You should also avoid doing things which can take a long time as the module is not considered to be loaded until it has completed.
Generating a parse transform to statically insert the result of compiling your re's directly into your code is a cool idea, especially if your re's are really that statically defined. As is the idea of dynamically generating, compiling and loading a module into your system. Again if your data is that static you could generate this module at compile time.
mochiglobal implements this by compiling a new module to store your constant(s). The advantage here is that the memory is shared across processes, where in ets it's copied and in the process dictionary it's just local to that one process.
https://github.com/mochi/mochiweb/blob/master/src/mochiglobal.erl
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With