Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I parsing a complex file format in Delphi? (Not CSV, XML, etc)

It's been a few years since I've had to parse any files which were harder than CSV or XML so I am out of practice. I've been given the task of parsing a file format called NeXus in a Delphi application.

The problem is I just don't know where to start, do I use a tokenizer, regex, etc? Maybe even a tutorial might be what I need at this point.

like image 545
Daisetsu Avatar asked Jul 20 '10 21:07

Daisetsu


4 Answers

Have a look at GOLD Parser. It's a meta-parsing system that allows you to define a formal grammar for a language/file format. It creates a parsing rules file which you feed into a tokenizer, together with your input file, and it creates a syntax tree in memory.

There's a Delphi implementation of the tokenizer available on the website. It makes parsing a lot easier since the lexing and tokenizing is already taken care of for you, and all you have to worry about is defining the tokens in a formal grammar and then interpreting them once they've been parsed.

like image 82
Mason Wheeler Avatar answered Nov 01 '22 13:11

Mason Wheeler


Check this out, it's commercial, but it looks like a fun toy:

http://dpg.zenithlab.com/

But, actually: For nexus you do not need a complicated parser.

A bit of position checking code, and some string-splitting and parenthesis counting, and you've got it written.

I would parse it using a simple token-at-a-time parser like this:

  1. load file into a TStringList.
  2. for each line, grab one token at a time, to determine line type.
    have an enumerated type for this line type.
  3. first valid non-blank line should be deteted to be a valid #nexus tag.
  4. next the header area (skipped mostly it looks like)
  5. begin is the first and keyword on the line.
  6. following lines inside the begin block appear to be almost like a DOS command and its command line parameters and are separated by spaces, and end with semicolons. pretty much like pascal, but parenthesis.

For the above I would code for myself a little set of helpers, and eventually one of the things I might need to write is a little token splitting function like this:

function GetToken( var inputString:String;outputToken:String; const Separators:TStrings;Keywords:TStrings;ParenFlag:Boolean):Boolean;

GetToken would return true when it was able to find and return a token string from inputString, it would skip any leading whitespace, and terminate when it finds a separator. Separators are items like space or comma.
ParenFlag:True would mean that the next token I get should be an entire parenthesized list of items. Once I get the whole parenthesized list (((a,b),(c,d),(e,f))) then I would call another function that would unpack the content of that list into some data structure for the lists/arrays.

I do not recommend the big parser engine, and the BNF grammar thing will help you write the code if you write a BNF grammar first before you write the parser. But there's nothing so brutal here that you can't parse it.

Are you going to be expected to do queries/transforms on this? Do you think you need to convert it into json or xml in order to work further with it?

like image 28
Warren P Avatar answered Nov 01 '22 13:11

Warren P


In addition to Mason's very nice answer. There is a great little class in Delphi that is often underappreciated, and one that you can learn a really nice technique from and thats the PageProducer class.

Have a look at the way that it parses HTML and surfaces events on things like finding tags, attributes etc. I'm not saying use the PageProducer (because you won't be able to for Nexus) but its a very simple, elegant and powerful technique.

like image 32
Tim Jarvis Avatar answered Nov 01 '22 14:11

Tim Jarvis


Haven't found a test format yet a state machine won't parse. Add in recursion to run down nests in trees. They are an easily written relatively quick parsing engine that can be built for virtually any patterned text file. Often easier than using a scripting language to boot. I have custom ones written for HTML, XML, HL7 and a variety of medical EDI formats.

like image 32
Cameron Avatar answered Nov 01 '22 12:11

Cameron