It's been a few years since I've had to parse any files which were harder than CSV or XML so I am out of practice. I've been given the task of parsing a file format called NeXus in a Delphi application.
The problem is I just don't know where to start, do I use a tokenizer, regex, etc? Maybe even a tutorial might be what I need at this point.
Have a look at GOLD Parser. It's a meta-parsing system that allows you to define a formal grammar for a language/file format. It creates a parsing rules file which you feed into a tokenizer, together with your input file, and it creates a syntax tree in memory.
There's a Delphi implementation of the tokenizer available on the website. It makes parsing a lot easier since the lexing and tokenizing is already taken care of for you, and all you have to worry about is defining the tokens in a formal grammar and then interpreting them once they've been parsed.
Check this out, it's commercial, but it looks like a fun toy:
http://dpg.zenithlab.com/
But, actually: For nexus you do not need a complicated parser.
A bit of position checking code, and some string-splitting and parenthesis counting, and you've got it written.
I would parse it using a simple token-at-a-time parser like this:
For the above I would code for myself a little set of helpers, and eventually one of the things I might need to write is a little token splitting function like this:
function GetToken( var inputString:String;outputToken:String; const Separators:TStrings;Keywords:TStrings;ParenFlag:Boolean):Boolean;
GetToken would return true when it was able to find and return a token string from inputString, it would skip any leading whitespace, and terminate when it finds a separator. Separators are items like space or comma.
ParenFlag:True would mean that the next token I get should be an entire parenthesized list of items. Once I get the whole parenthesized list (((a,b),(c,d),(e,f))) then I would call another function that would unpack the content of that list into some data structure for the lists/arrays.
I do not recommend the big parser engine, and the BNF grammar thing will help you write the code if you write a BNF grammar first before you write the parser. But there's nothing so brutal here that you can't parse it.
Are you going to be expected to do queries/transforms on this? Do you think you need to convert it into json or xml in order to work further with it?
In addition to Mason's very nice answer. There is a great little class in Delphi that is often underappreciated, and one that you can learn a really nice technique from and thats the PageProducer class.
Have a look at the way that it parses HTML and surfaces events on things like finding tags, attributes etc. I'm not saying use the PageProducer (because you won't be able to for Nexus) but its a very simple, elegant and powerful technique.
Haven't found a test format yet a state machine won't parse. Add in recursion to run down nests in trees. They are an easily written relatively quick parsing engine that can be built for virtually any patterned text file. Often easier than using a scripting language to boot. I have custom ones written for HTML, XML, HL7 and a variety of medical EDI formats.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With