Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stack overflow in Prolog DCG grammar rule: how to handle large lists efficiently or lazily

I'm parsing a fairly simple file format consisting of a series of lines, each line having some space separated fields, that looks like this:

l 0x9823 1
s 0x1111 3
l 0x1111 12
⋮

I'm using SWI-Prolog. This is the DCG I have so far:

:- consult(library(pure_input)).

load_trace(Filename, Traces) :-
    phrase_from_file(trace_file_phrase(Traces), Filename).

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) --> trace_phrase(T), trace_file_phrase(Ts).

trace_phrase(access(Type, Address, SinceLast)) -->
    access_type(Type), space,
    address(Address),  space,
    nat(SinceLast),    newline.

access_type(load)  --> "l".
access_type(store) --> "s".

address(Number) --> "0x", hexnum(Number).

hexdigit(N)  --> digit(N).
hexdigit(10) --> "a". hexdigit(11) --> "b". hexdigit(12) --> "c".
hexdigit(13) --> "d". hexdigit(14) --> "e". hexdigit(15) --> "f".
hexnum(N) --> hexdigit(D), hexnum(D, N).
hexnum(N, N) --> [].
hexnum(A, N) --> hexdigit(D), { A1 is A*16 + D }, hexnum(A1, N).

newline --> "\n".
space --> " ".

%% the following two productions are courtesy of Lars Mans at
%% https://stackoverflow.com/questions/3279822/parsing-numbers-with-multiple-digits-in-prolog
digit(0) --> "0". digit(1) --> "1". digit(2) --> "2".
digit(3) --> "3". digit(4) --> "4". digit(5) --> "5".
digit(6) --> "6". digit(7) --> "7". digit(8) --> "8".
digit(9) --> "9".

nat(N)   --> digit(D), nat(D,N).
nat(N,N) --> [].
nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1, N).

As mentioned in the comment, I cribbed the number handling from Parsing numbers with multiple digits in Prolog.

The problem I'm running into is some of these files are large, like, on the order of 5-10 MB. The default stack in SWI-Prolog is insufficient for this, and parsing these files is taking substantial time, on the order of 5-15 seconds. I have several questions about this situation:

  1. Where is the efficiency problem in this code? I think it's either in trace_file_phrase//1 or nat//1 but these are just hunches.
  2. If the problem is lists, is there a better way to handle lists with DCGs than this?
  3. How does one, in general, diagnose and treat performance problems with DCGs such as this?
like image 636
Daniel Lyons Avatar asked Oct 17 '12 17:10

Daniel Lyons


2 Answers

As a general remark you will find more on SO about it under the name library(pio). Also the way to use it cleanly is rather:

:- use_module(library(pio)).

Your example is way too complex, so I will only consider a slightly simpler case, a newline separated list of numbers:

nats([]) --> [].
nats([N|Ns]) --> nat(N), newline, nats(Ns).

So, how can you test this effectively? (That's your Question 3) The basic point of library(pio) is that you can use regular DCGs for file processing. But for testing in the small you can still use the simple phrase/2. So I do:

?- phrase(nats(Ns),"1\n").
   Ns = [1]
;  false.

Did you see the ; prompt? That means: Prolog was not able to decide whether or not further answers might be computed - so it leaves one or more choice-points open. And that only for a single digit You can imagine how things will pile up.

Let's dig deeper:

?- phrase(digit(D),"1").
   D = 1
;  false.

Again the evil ; false! In order to make this work, everything would have to be determinate. There are three ways to this:

Use cuts (and lose your soul)

I wish you luck - the best seems to be just after the repeating element:

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) -->
   trace_phrase(T),
   !, % ugly, but...
   trace_file_phrase(Ts).

(This should answer Question 1)

But, hold on a minute! What is so bad about this !? As long, as there is exactly one answer to trace_phrase//1 things are perfect. It is only, if there are more answers (or actually solutions), that the cut might remove precious answers. How do you know, if there are more solutions? Well, you don't. And you won't see them as they have been cut away already.

call_semidet/1

Here is a way to ensure that this does not happen. This works only for side-effect free goals that can be called twice without any effect:

call_semidet(Goal) :-
   (  call_nth(Goal, 2)
   -> throw(error(mode_error(semidet,Goal),_))
   ;  once(Goal)
   ).

This uses call_nth/2, as defined in another post. (As an optimization, the implementation could avoid calling Goal twice when there is no choice-point open...) Just to make clear, how it works:

?- phrase(nat(N),"1234").
   N = 1234
;  false.
?- call_semidet(phrase(nat(N),"1234")).
   N = 1234.
?- call_semidet((X=1;X=2)).
   error(mode_error(semidet, (2=1;2=2)), _).

So it makes your little grammar effectively determinate! There is thus no need to reformulate anything!

What is lacking now is some integration of this into the grammar. You can do this very low-level, or rather cleanly using library(lambda).

phrase_semidet(NT) -->
   call(S0^S^call_semidet(phrase(NT,S0,S))).

Note that in this very particular case we do not use the \ for renaming.

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) -->
   phrase_semidet(trace_phrase(T)),
   trace_file_phrase(Ts).

Exploit indexing

Finally, a very laborious but clean way would be to rewrite everything to profit better from indexing (and maybe help to improve indexing in general...) But this is a long road. Just to start with:

digit(D) --> [C],
   {c_digit(C,D)}.

c_digit(0'0,0).
c_digit(0'1,1).
c_digit(0'2,2).
c_digit(0'3,3).
c_digit(0'4,4).
c_digit(0'5,5).
c_digit(0'6,6).
c_digit(0'7,7).
c_digit(0'8,8).
c_digit(0'9,9).

This gives you now:

?- phrase(digit(D),"1").
   D = 1.

But you have another source of nondeterminism which is rather due to the way you define the grammar. In nat//2 you see it:

nat(N,N) --> [].
nat(A,N) --> digit(D), ... .

The first rule always applies, that is, "1234\n" will be parsed as "1" "12" "123" "1234" only the following newline//0 realizes that it would suffice go for the last — and then stick to it.

You can rewrite things for that, but then, the code is no longer the pure little spec you liked, isn't it? Well, maybe things might improve in the future.

E.g. indexing is much better in SWI than it used to be, maybe also things evolve here too....

The intention of library(pio) was to get this process started. Compare this to Haskell — we are far away from interact efficiency-wise! But there is no inherent cost:

... --> [] | [_], ... .

?- phrase_from_file((...,"searchstring",...),fichier).

is just as efficient as grep — space-wise. That is, it runs in constant space. So hopefully more code will run better in the future.

Edit: BTW, library(pio) did already have an impact efficiency-wise: GC phases were improved significantly, very much in the same manner as Wadler's Fixing some space leak – paper a quarter century ago. Things evolve ...

Edit almost 10 years later: a related answer.

like image 95
false Avatar answered Oct 20 '22 20:10

false


I've verified the stackoverflow on a 2Mb file. Then I rewrote the grammar using library(dcg/basics), and now is working.

:- [library(dcg/basics)].

load_trace_0(Filename, Ls) :-
    phrase_from_file(lines(Ls), Filename).

lines([s(H,I)|R]) -->
    "s 0x", xinteger(H), " ",
    integer(I), blanks,
    !, lines(R).
lines([l(H,I)|R]) -->
    "l 0x", xinteger(H), " ",
    integer(I), blanks,
    !, lines(R).
lines([]) --> [].

But then I tried to put the cut on your grammar, and is working as well. So the answer from @gusbro (+1) solves your problem.

like image 6
CapelliC Avatar answered Oct 20 '22 20:10

CapelliC