Poor man's "lexer" for C#

Tags:

I'm trying to write a very simple parser in C#.

I need a lexer -- something that lets me associate regular expressions with tokens, so it reads in regexs and gives me back symbols.

It seems like I ought to be able to use Regex to do the actual heavy lifting, but I can't see an easy way to do it. For one thing, Regex only seems to work on strings, not streams (why is that!?!?).

Basically, I want an implementation of the following interface:

interface ILexer : IDisposable {     /// <summary>     /// Return true if there are more tokens to read     /// </summary>     bool HasMoreTokens { get; }     /// <summary>     /// The actual contents that matched the token     /// </summary>     string TokenContents { get; }     /// <summary>     /// The particular token in "tokenDefinitions" that was matched (e.g. "STRING", "NUMBER", "OPEN PARENS", "CLOSE PARENS"     /// </summary>     object Token { get; }     /// <summary>     /// Move to the next token     /// </summary>     void Next(); }  interface ILexerFactory {     /// <summary>     /// Create a Lexer for converting a stream of characters into tokens     /// </summary>     /// <param name="reader">TextReader that supplies the underlying stream</param>     /// <param name="tokenDefinitions">A dictionary from regular expressions to their "token identifers"</param>     /// <returns>The lexer</returns>     ILexer CreateLexer(TextReader reader, IDictionary<string, object> tokenDefinitions); }

So, pluz send the codz...
No, seriously, I am about to start writing an implementation of the above interface yet I find it hard to believe that there isn't some simple way of doing this in .NET (2.0) already.

So, any suggestions for a simple way to do the above? (Also, I don't want any "code generators". Performance is not important for this thing and I don't want to introduce any complexity into the build process.)

990

asked Mar 23 '09 11:03

Paul Hollingsworth

2 Answers

The original version I posted here as an answer had a problem in that it only worked while there was more than one "Regex" that matched the current expression. That is, as soon as only one Regex matched, it would return a token - whereas most people want the Regex to be "greedy". This was especially the case for things such as "quoted strings".

The only solution that sits on top of Regex is to read the input line-by-line (which means you cannot have tokens that span multiple lines). I can live with this - it is, after all, a poor man's lexer! Besides, it's usually useful to get line number information out of the Lexer in any case.

So, here's a new version that addresses these issues. Credit also goes to this

public interface IMatcher {     /// <summary>     /// Return the number of characters that this "regex" or equivalent     /// matches.     /// </summary>     /// <param name="text">The text to be matched</param>     /// <returns>The number of characters that matched</returns>     int Match(string text); }  sealed class RegexMatcher : IMatcher {     private readonly Regex regex;     public RegexMatcher(string regex) => this.regex = new Regex(string.Format("^{0}", regex));      public int Match(string text)     {         var m = regex.Match(text);         return m.Success ? m.Length : 0;     }     public override string ToString() => regex.ToString(); }  public sealed class TokenDefinition {     public readonly IMatcher Matcher;     public readonly object Token;      public TokenDefinition(string regex, object token)     {         this.Matcher = new RegexMatcher(regex);         this.Token = token;     } }  public sealed class Lexer : IDisposable {     private readonly TextReader reader;     private readonly TokenDefinition[] tokenDefinitions;      private string lineRemaining;      public Lexer(TextReader reader, TokenDefinition[] tokenDefinitions)     {         this.reader = reader;         this.tokenDefinitions = tokenDefinitions;         nextLine();     }      private void nextLine()     {         do         {             lineRemaining = reader.ReadLine();             ++LineNumber;             Position = 0;         } while (lineRemaining != null && lineRemaining.Length == 0);     }      public bool Next()     {         if (lineRemaining == null)             return false;         foreach (var def in tokenDefinitions)         {             var matched = def.Matcher.Match(lineRemaining);             if (matched > 0)             {                 Position += matched;                 Token = def.Token;                 TokenContents = lineRemaining.Substring(0, matched);                 lineRemaining = lineRemaining.Substring(matched);                 if (lineRemaining.Length == 0)                     nextLine();                  return true;             }         }         throw new Exception(string.Format("Unable to match against any tokens at line {0} position {1} \"{2}\"",                                           LineNumber, Position, lineRemaining));     }      public string TokenContents { get; private set; }     public object Token   { get; private set; }     public int LineNumber { get; private set; }     public int Position   { get; private set; }      public void Dispose() => reader.Dispose(); }

Example program:

string sample = @"( one (two 456 -43.2 "" \"" quoted"" ))";  var defs = new TokenDefinition[] {     // Thanks to [steven levithan][2] for this great quoted string             // regex     new TokenDefinition(@"([""'])(?:\\\1|.)*?\1", "QUOTED-STRING"),     // Thanks to http://www.regular-expressions.info/floatingpoint.html     new TokenDefinition(@"[-+]?\d*\.\d+([eE][-+]?\d+)?", "FLOAT"),     new TokenDefinition(@"[-+]?\d+", "INT"),     new TokenDefinition(@"#t", "TRUE"),     new TokenDefinition(@"#f", "FALSE"),     new TokenDefinition(@"[*<>\?\-+/A-Za-z->!]+", "SYMBOL"),     new TokenDefinition(@"\.", "DOT"),     new TokenDefinition(@"\(", "LEFT"),     new TokenDefinition(@"\)", "RIGHT"),     new TokenDefinition(@"\s", "SPACE") };  TextReader r = new StringReader(sample); Lexer l = new Lexer(r, defs); while (l.Next())     Console.WriteLine("Token: {0} Contents: {1}", l.Token, l.TokenContents);

Output:

Token: LEFT Contents: ( Token: SPACE Contents: Token: SYMBOL Contents: one Token: SPACE Contents: Token: LEFT Contents: ( Token: SYMBOL Contents: two Token: SPACE Contents: Token: INT Contents: 456 Token: SPACE Contents: Token: FLOAT Contents: -43.2 Token: SPACE Contents: Token: QUOTED-STRING Contents: " \" quoted" Token: SPACE Contents: Token: RIGHT Contents: ) Token: RIGHT Contents: )

108

answered Sep 21 '22 15:09

Paul Hollingsworth

It may be overkill, but have a look at Irony on CodePlex.

Irony is a development kit for implementing languages on .NET platform. It uses the flexibility and power of c# language and .NET Framework 3.5 to implement a completely new and streamlined technology of compiler construction. Unlike most existing yacc/lex-style solutions Irony does not employ any scanner or parser code generation from grammar specifications written in a specialized meta-language. In Irony the target language grammar is coded directly in c# using operator overloading to express grammar constructs. Irony's scanner and parser modules use the grammar encoded as c# class to control the parsing process. See the expression grammar sample for an example of grammar definition in c# class, and using it in a working parser.

answered Sep 21 '22 15:09

Andy Dent

Related questions
                            
                                How do I get the taskbar's position and size?
                            
                                Is it possible insert image to a code comment?
                            
                                Unit testing HTTP requests in c#
                            
                                MVVM: Modified model, how to correctly update ViewModel and View?
                            
                                c# Abstract Class implementing an Interface
                            
                                Getting started with protobuf-net
                            
                                None of the constructors found with 'Autofac.Core.Activators.Reflection.DefaultConstructorFinder'
                            
                                Change the JSON serialization settings of a single ASP.NET Core controller
                            
                                How to set the output path of several visual C# projects
                            
                                Howto write a function taking variable number of arguments in F#
                            
                                Is there a way to get the stacktraces for all threads in c#, like java.lang.Thread.getAllStackTraces()?
                            
                                Stubbing / mocking a database in .Net
                            
                                what is the difference between these two methods?
                            
                                RoutePrefix vs Route
                            
                                Why does C# require parentheses when using nullables in an expression?
                            
                                Throwing exceptions in callback method for Timers
                            
                                How to find one image inside of another?
                            
                                How would I return an object or multiple values from PowerShell to executing C# code
                            
                                How different async programming is from Threads?
                            
                                Enabling Microsoft's Code Analysis on .NET Core Projects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Poor man's "lexer" for C#

Tags:

c#

regex

lexer

Paul Hollingsworth

People also ask

2 Answers

Paul Hollingsworth

Andy Dent

Recent Activity

Donate For Us