Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Japanese COBOL Code: rules for G literals and identifiers?

We are processing IBMEnterprise Japanese COBOL source code.

The rules that describe exactly what is allowed in G type literals, and what are allowed for identifiers are unclear.

The IBM manual indicates that a G'....' literal must have a SHIFT-OUT as the first character inside the quotes, and a SHIFT-IN as the last character before the closing quote. Our COBOL lexer "knows" this, but objects to G literals found in real code. Conclusion: the IBM manual is wrong, or we are misreading it. The customer won't let us see the code, so it is pretty difficult to diagnose the problem.

EDIT: Revised/extended below text for clarity:

Does anyone know the exact rules of G literal formation, and how they (don't) match what the IBM reference manuals say? The ideal answer would a be regular expression for the G literal. This is what we are using now (coded by another author, sigh):

#token non_numeric_literal_quote_g [STRING]
  "<G><squote><ShiftOut> (  
     (<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)  
     (<NotLineOrParagraphSeparator>|<squote><squote>)

     | <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
                   <ShiftIn>|<ShiftOut>)

     | <squote><squote>

 )* <ShiftIn><squote>"

where <name> is a macro that is another regular expression. Presumably they are named well enough so you can guess what they contain.

Here is the IBM Enterprise COBOL Reference. Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading. I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means when it says "one or more characters in the range X'00...X'FF for either byte" How can DBCS-characters be anything but pairs of 8-bit character codes? The existing RE matches 3 types of pairs of characters if you examine it.

One answer below suggests that the <squote><squote> pairing is wrong. OK, I might believe that, but that means the RE would only reject literal strings containing single <squote>s. I don't believe that's the problem we are having as we seem to trip over every instance of a G literal.

Similarly, COBOL identifiers can apparantly be composed with DBCS characters. What is allowed for an identifier, exactly? Again a regular expression would be ideal.

EDIT2: I'm beginning to think the problem might not be the RE. We are reading Shift-JIS encoded text. Our reader converts that text to Unicode as it goes. But DBCS characters are really not Shift-JIS; rather, they are binary-coded data. Likely what is happening is the that DBCS data is getting translated as if it were Shift-JIS, and that would muck up the ability to recognize "two bytes" as a DBCS element. For instance, if a DBCS character pair were :81 :1F, a ShiftJIS reader would convert this pair into a single Unicode character, and its two-byte nature is then lost. If you can't count pairs, you can't find the end quote. If you can't find the end quote, you can't recognize the literal. So the problem would appear to be that we need to switch input-encoding modes in the middle of the lexing process. Yuk.

like image 982
Ira Baxter Avatar asked Sep 09 '09 05:09

Ira Baxter


2 Answers

Try to add a single quote in your rule to see if it passes by making this change,

  <squote><squote> => <squote>{1,2}

If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.

EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,

  G"ABC<ヲァィ>" <> are Shift-out/shift-in

You RE assumes the DBCS only. I would loose this restriction and try again.

I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.

You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.

I am just trying to help you shoot in the dark without seeing the actual code :)

like image 114
ZZ Coder Avatar answered Oct 02 '22 08:10

ZZ Coder


Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...

I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.

like image 30
lcv Avatar answered Oct 02 '22 07:10

lcv