I always seem to struggle to write DCG's to parse input files. But it seems it should be simple? Are there any tips or tricks to think about this problem?
For a concrete example, lets say I want to parse a fasta file. (https://en.wikipedia.org/wiki/FASTA_format). I want to read each description and each sequence on back tracking.
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- portray_text(true).
:- set_prolog_flag(double_quotes, codes).
:- set_prolog_flag(back_quotes,string).
fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
fasta_section(Section),
fasta_file(Sections).
fasta_section(Section) -->
fasta_description(Description),
fasta_seq(Sequence),
{Section =.. [section,Description,Sequence]}.
fasta_description(Description) -->
">",
string(Description),
{no_gt(Description),
no_nl(Description)}.
fasta_seq([]) --> [].
fasta_seq(Seq) -->
nt([S]),
fasta_seq(Ss),
{S="X"->Seq =Ss;Seq=[S|Ss]}.
nt("A") --> "A".
nt("C") --> "C".
nt("G") --> "G".
nt("T") --> "T".
nt("X") --> "\n".
no_gt([]).
no_gt([E|Es]):-
dif([E],">"),
no_gt(Es).
no_nl([]).
no_nl([E|Es]):-
dif([E],"\n"),
no_nl(Es).
Now this is clearly wrong. The behaviour I would like is
?-phrase(fasta_section(S),">frog\nACGGGGTACG\n>duck\nACGTTAG").
S = section("frog","ACGGGGTACG");
S = section("duck","ACGTTAG");
false.
But if I did phrase(fasta_file(Sections),">frog\nACGGGGTACG\n>duck\nACGTTAG).
Sections is unified with a list of sections/2s, which is what I want, but my current code seems quite hacky- how I have handled the newline character for example.
for sure, there are 'small' typing problems:
nt("A") -->"A",
nt("C") -->"C",
nt("G") -->"G",
nt("T") -->"T".
should be
nt("A") -->"A".
nt("C") -->"C".
nt("G") -->"G".
nt("T") -->"T".
anyway, I also had my problems debugging DCG, I wrote a parser to load in Prolog a MySQL dump (plain SQL, really), and was a pain when something unexpected, like escaped strings, or UTF8 (?) weird encodings were found.
I would suggest to use phrase/3, to see if there is an unparsable tail. Also, could help to place some debug output after known, well behaved sequences.
Of course, I assume you already tried to use the SWI-Prolog debugger.
Also, beware of
...
dif([E],">"),
...
did you set the appropriate flag about double quotes ? In DCG bodies, the rewrite machinery takes care of matching, but a sequence of codes in SWI-Prolog by default doesn't match double quoted strings...
edit
I think this will not solve your doubt about a general strategy... anyway, it's how I would handle the problem...
fasta_file([]) -->[].
fasta_file([Section|Sections]) -->
fasta_section(Section),
fasta_file(Sections).
fasta_section(section(Description,Sequence)) -->
fasta_description(Description),
fasta_seq(SequenceCs), {atom_codes(Sequence, SequenceCs)}, !.
fasta_description(Description) -->
">", string(DescriptionCs), "\n", {atom_codes(Description, DescriptionCs)}.
fasta_seq([S|Seq]) --> nt(S), fasta_seq(Seq).
fasta_seq([]) --> "\n" ; []. % optional \n at EOF
nt(0'A) --> "A".
nt(0'C) --> "C".
nt(0'G) --> "G".
nt(0'T) --> "T".
now
?- phrase(fasta_file(S), `>frog\nACGGGGTACG\n>duck\nACGTTAG`).
S = [section(frog, 'ACGGGGTACG'), section(duck, 'ACGTTAG')] ;
false.
note: the order of clauses fasta_seq//1 is important, since it implements an 'eager' parsing - mainly for efficiency. As I said, I had to parse SQL, several MBs was common.
edit
?- phrase((string(_),fasta_section(S)), `>frog\nACGGGGTACG\n>duck\nACGTTAG`,_).
S = section(frog, 'ACGGGGTACG') ;
S = section(duck, 'ACGTTAG') ;
false.
fasta_section//1 is mean to match a definite sequence. To get all on backtracking we must provide a backtrack point. In this case, string//1 from library(dcg/basics) does the job
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With