Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should text-processing DCGs be written to handle codes or chars? Or both?

Tags:

char

prolog

dcg

In Prolog, there are traditionally two ways of representing a sequence of characters:

  • As a list of chars, which are atoms of length 1.
  • As a list of codes, which are just integers. The integers are to be interpreted as codepoints, but the convention to be applied is left unspecified. As a (eminently sane) example, in SWI-Prolog, the space of codepoints is Unicode (thus, roughly, the codepoint-integers range from 0 and 0x10FFFF).

DCGs, a notational way of writing left-to-right list processing code, are designed to perfom parsing on "lists of exploded text". Depending on preference, the lists to-be-handled can be lists of chars or lists of codes. However, the notation for char/code processing differs when writing down the constants. Does one generally write the DCG in "char style" or "code style"? Or maybe even in char/code style for portability in case of modules exporting DCG nonterminals?

Some Research

The following notations can be used to express constants in DCGs

  • 'a': A char (as usual: single quotes indicate an atom, and they can be left out if the token starts with a lowercase letter.)
  • 0'a: the code of a .
  • ['a','b']: A list of char.
  • [ 0'a, 0'b ]: A list of codes, namely the codes for a and b (so you can avoid typing in the actual codepoint values).
  • "a" a list of codes. Traditionally, a double-quoted string is exploded into a list of codes, and this notation also works SWI-Prolog in DCG contexts, even though SWI-Prolog maps a "double-quoted string" to the special string datatype otherwise.
  • `0123`. Traditonally, text within back-quotes is mapped to an atom (I think, the 95 ISO Standard just avoids being specific regarding the meaning of a back-quoted string. "It would be a valid extension of this part of ISO/IEC 13211 to define a back quoted string as denoting a character string constant."). In SWI-Prolog, text within back-quotes is exploded into a list of codes unless the flag back_quotes has been set to demand a different behaviour.

Examples

Char style

Trying to recognize "any digit" in "char style" and make its "char representation" available in C:

zero(C) --> [C],{C = '0'}. 

nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.

any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).

Code style

Trying to recognize "any digit" in "code style":

zero(C) --> [C],{C = 0'0}.

nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.

any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).

Char/Code transparent style

DCGs can be written as "char/code transparent style" by duplicating the rules involving constants. In the above example:

zero(C) --> [C],{C = '0'}. 
zero(C) --> [C],{C = 0'0}.

nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.

any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).

The above also accepts a sequence of alternating codes and chars (as lists of stuff cannot be typed). This is probably not a problem). When generating, one will get arbitrary char/code mixes which are unwanted, and then cuts need to be added.

Char/Code transparent style taking an additional Mode indicator

Another approach would be to explicitly indicate the mode. Looks clean:

zero(C,chars) --> [C],{C = '0'}. 
zero(C,codes) --> [C],{C = 0'0}.

nonzero(C,chars) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C,codes) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.

any_digit(C,Mode) --> zero(C,Mode).
any_digit(C,Mode) --> nonzero(C,Mode).

Char/Code transparent style using dialect features

Alternatively, features of the Prolog dialect can be used to achieve char/code transparency. In SWI-Prolog, there is code_type/2, which actually works on codes and chars (there is a corresponding char_type/2 but IMHO there should be only chary_type/2 working for chars and codes in any case) and for "digit-class" codes and chars yield the compound digit(X):

?- code_type(0'9,digit(X)).
X = 9.

?- code_type('9',digit(X)).
X = 9.

?- findall(W,code_type('9',W),B).
B = [alnum,csym,prolog_identifier_continue,ascii,
     digit,graph,to_lower(57),to_upper(57),
     digit(9),xdigit(9)].

And so one can write this for clean char/code transparency:

zero(C) --> [C],{code_type(C,digit(0)}. 

nonzero(C) --> [C],{code_type(C,digit(X),X>0}.

any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).

In SWI-Prolog in particular

SWI-Prolog by default prefers codes. Try this:

The flags

  • back_quotes
  • double_quotes

influence interpretation of "string" and `string` in "standard code". By default"string" is interpreted as an atomic "string" whereas `string` is interpreted as a "list of codes".

Outside of DCGs, the following holds in SWI-Prolog, with all flags at their default:

?- string("foo"),\+atom("foo"),\+is_list("foo").
true.

?- L=`foo`.
L = [102,111,111].

However, in DCGs, both "string" and `string` are interpreted as "codes" by default.

Without any settings changed, consider this DCG:

representation(double_quotes)    --> "bar".            % SWI-Prolog decomposes this into CODES 
representation(back_quotes)      --> `bar`.            % SWI-Prolog decomposes this into CODES
representation(explicit_codes_1) --> [98,97,114].      % explicit CODES (as obtained via atom_codes(bar,Codes))
representation(explicit_codes_2) --> [0'b,0'a,0'r].    % explicit CODES 
representation(explicit_chars)   --> ['b','a','r'].    % explicit CHARS

Which of the above matches codes?

?- 
findall(X,
   (atom_codes(bar,Codes),
    phrase(representation(X),Codes,[])),
   Reps).

Reps = [double_quotes,back_quotes,explicit_codes_1,explicit_codes_2].

Which of the above matches chars?

?- findall(X,
   (atom_chars(bar,Chars),phrase(representation(X),Chars,[])),
   Reps).
Reps = [explicit_chars].

When starting swipl with swipl --traditional the backquoted representation is rejected with Syntax error: Operator expected , but otherwise nothing changes.

like image 991
David Tonhofer Avatar asked Feb 21 '21 10:02

David Tonhofer


Video Answer


2 Answers

The Prolog Standard (6.3.7) says:

A double quoted list is either an atom (6.3.1.3) or a list (6.3.5).

Consequently, the following should succeed:

Welcome to SWI-Prolog (threaded, 64 bits, version 7.6.4)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- Foo = "foo", (atom(Foo) ; Foo = [F, O, O]).
false.

So SWI-Prolog is not a Prolog by default. That's OK, but if you want to know about SWI-Prolog's non-Prolog behavior, please adjust the tags on the question.

From the definition it also follows that double quoted lists are completely useless by default even in a conforming Prolog: They might denote atoms, so regardless of the chars/codes distinction you can't even know that the double quoted list is actually a list. Even DCGs that only care about structural properties of the "text" (whether it's a palindrome, for example) are useless if the "list" is in fact an atom.

Hence a Prolog program that wants to process text with DCGs must at startup force the double_quotes flag to the value it wants. You have the choice between codes and chars. Codes have no advantages over chars, but they do have disadvantages in readability and typeability. Thus:

Answer: Use chars. Set the double_quotes flag explicity.

like image 165
Isabelle Newbie Avatar answered Sep 21 '22 23:09

Isabelle Newbie


I should start to by noting that the answer to the "Should text-processing DCGs be written to handle codes or chars? Or both?" question can be neither. DCGs work by using an implicit difference list to thread state. But the elements of that difference list can be other than chars or codes. It depends on the output of text tokenization and what exactly text processing entails. E.g. I have worked on and come across Prolog NLP applications where codes/chars were only used for the basic tokenization and the reasoning was performed (still with DCGS) using either atoms or compound terms that reified the token data (e.g. v(Verb) or n(Noun)). One of those applications (a personal assistant like it's common nowadays in phones) used atoms produced by a voice-recognition component.

But let's go back to chars vs codes. Legacy practices and failed standardization left Prolog with problematic text representation. ASCII gives us a singe quote, a back quote, and a double-quote. With single quotes being used for atoms, a choice could have been made to use e.g. back quotes for representing a list of codes and double-quotes for representing a list of chars. Or the other way around. Instead, and this is where standardization failed, we got the problematic double_quotes flag. There's no shortage of Prolog code in the wild that makes an assumption about the meaning of double-quoted terms and thus works or breaks depending on the implicit value of the double_quotes flag (if you think this is mainly an issue with legacy code, think again). Guess what happens when we try to combine code that require different values for the flag? Note that, in almost all systems (including those that support modules), the flag value is global ... As Isabelle wrote in her answer, setting the flag explicitly is good general advice. But not, as I explained, without problems.

Some systems provide additional values for the flag. E.g. SWI-Prolog allows the flag to also be set to string. GNU Prolog supports additional atom_no_escape, chars_no_escape and codes_no_escape. Some systems only support codes. Some systems also provide a back_quotes flag. This Babel tower means that portable and resilient code is often forced to use atoms to represent text. But this is may not be ideal from a performance perspective.

Back to the original question. As Isabelle mentioned, chars is usually a more readable (read, easier to debug) choice. But, depending on the Prolog system, codes may provide better performance. If application performance is critical, benchmark both solutions. Some recent Prolog systems (e.g. Scryer-Prolog or Trealla Prolog) have efficient support for chars. Older systems may trail behind.

like image 32
Paulo Moura Avatar answered Sep 19 '22 23:09

Paulo Moura