I’m recently writing a small programming language, and have finished writing its parser. I want to write an automated test for the parser (that its result is an abstract syntax tree), but I’m not sure which way is better.
First what I tried is just to serialize AST to S-expression text and compare it to the expected output text I wrote by hand, but it has some problems:
There are trivial meaningless differences between a serialized text and the expected output like whitespaces. For example, there is no difference between:
(attribute (symbol str) (symbol length))
(that is serialized) and:
(attribute (symbol str)
(symbol length))
(that is handwritten by me) in their meanings, but string comparison distincts them of course. Okay, I could resolve it by normalization.
When a test fails, it doesn’t show the difference between actual tree and expected tree concisely. I want to show only a difference node, not whole tree.
Second what I tried is to write S-expression parser and compare AST that parser (to be tested) generates to AST that S-expression parser (that I just implemented) generates from the handwritten expected output. However I realized that S-expression have to be tested also and it could be really nonsense.
I wonder what is the typical and easy way to test the parser.
PS. I am using Java, and dont’t want any dependencies to third-party libraries.
Strictly speaking, unit tests should not use the file system because it is slow. However, readability is more important. XML in a file is easier to read, and can be loaded in an XML friendly editor.
So, what is text parsing? In simple terms, it is a common programming task that separates the given series of text into smaller components based on some rules. Its application ranges from document parsing to deep learning NLP.
A parser is a compiler or interpreter component that breaks data into smaller elements for easy translation into another language. A parser takes input in the form of a sequence of tokens, interactive commands, or program instructions and breaks them up into parts that can be used by other components in programming.
Providing you are looking for a completely automated and extensible unit testing framework for your parser I'd recommend the following approach:
Create a set of samples of incorrect inputs. Then feed the parse with each of them making sure the parser rejects them. I's a good idea to provide metadata for each test case that defines the expected output — the specific error code / message the parser is supposed to produce.
As in the previous case, create a set of samples representing various correct inputs. Besides the simple validation that the parser accepts all inputs, there's still the problem of validating that the actual Abstract Syntax Tree makes sense.
To address this problem I'd do the following: Describing the expected AST for each test case in some well-known format that can be safely parsed — deserialized into the actual in-memory AST structures — by a 3rd party parser considered bug-free (for your case). The natural choice is XML since most languages / programming frameworks cover XML support and provide the respective (de)serialization facilities. The best solution would be to deserialize right into the AST node types. Since convenient visual editing tools for XML exist it's feasible to construct even large test cases.
Then I'd construct an AST comparer using the visitor pattern which pair-up the two ASTs and compare both nodes in each pair for equality. However, equality is a per-AST-node-type specific operation.
Notes:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With