Source of parsers for programming languages?

Tags:

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.

To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.

Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)

201

asked Apr 02 '10 03:04

Arkaaito

2 Answers

If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.

Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.

I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.

I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.

If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.

Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.

answered Nov 01 '22 12:11

Ira Baxter

You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.

answered Nov 01 '22 11:11

Michael Aaron Safyan

Related questions
                            
                                Indentation using Megaparsec
                            
                                Parsing nmap output
                            
                                Avoiding code duplication for data type with lots of similar constructors
                            
                                Coding a Gmail style "hide quoted text" for web based mailing list archive
                            
                                Recursive Descent Parser for C
                            
                                Is it possible to use Recursive Descent Parser to both verify the grammar AND build the parse tree at the same time?
                            
                                Operator precedence in boost::spirit?
                            
                                Error while using Newtonsoft.Json to parse a Json string
                            
                                DateTime TryParse - mapping '99' to 2099, not 1999 [duplicate]
                            
                                Python XPath parsing tag with apostrophe
                            
                                Is there a good CPAN module to implement state machines when parsing text?
                            
                                Does .NET framework offer methods to parse an HTML string?
                            
                                Change attribute type when parsing binary with boost::spirit
                            
                                Processing of mongolian names
                            
                                Validate a Boolean expression with brackets in C#
                            
                                Mapping ANTLR parse rules to custom Java AST classes for code generation
                            
                                Unit testing a compiler
                            
                                Is it possible to use Perl's Marpa parser for a public network server?
                            
                                Parsing email "Received:" headers
                            
                                Building an Inference Engine in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Source of parsers for programming languages?

Tags:

parsing

code-metrics

antlr

parser-generator

Arkaaito

People also ask

2 Answers

Ira Baxter

Michael Aaron Safyan

Recent Activity

Donate For Us