Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing VBA Const declarations... with regex

Tags:

c#

regex

parsing

I'm trying to write a VBA parser; in order to create a ConstantNode, I need to be able to match all possible variations of a Const declaration.

These work beautifully:

  • Const foo = 123
  • Const foo$ = "123"
  • Const foo As String = "123"
  • Private Const foo = 123
  • Public Const foo As Integer = 123
  • Global Const foo% = 123

But I have 2 problems:

  1. If there's a comment at the end of the declaration, I'm picking it up as part of the value:

    Const foo = 123 'this comment is included as part of the value
    
  2. If there's two or more constants declared in the same instruction, I'm failing to match the entire instruction:

    Const foo = 123, bar = 456 
    

Here is the regular expressions I'm using:

    /// <summary>
    /// Gets a regular expression pattern for matching a constant declaration.
    /// </summary>
    /// <remarks>
    /// Constants declared in class modules may only be <c>Private</c>.
    /// Constants declared at procedure scope cannot have an access modifier.
    /// </remarks>
    public static string GetConstantDeclarationSyntax()
    {
        return @"^((Private|Public|Global)\s)?Const\s(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?<as>\sAs\s(?<reference>(((?<library>[a-zA-Z][a-zA-Z0-9_]*))\.)?(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)))?\s\=\s(?<value>.*)$";
    }

Obviously both issues are caused by the (?<value>.*)$ part, which matches anything up until the end of the line. I got VariableNode to support multiple declarations in one instruction by enclosing the whole pattern in a capture group and adding an optional comma, but because constants have this value group, doing that resulted in the first constant having all following declarations captured as part of its value... which brings me back to problem #1.

I wonder if it's at all possible to solve problem #1 with a regular expression, given that the value may be a string that contains an apostrophe, and possibly some escaped (doubled-up) double quotes.

I think I can solve it in the ConstantNode class itself, in the getter for Value:

/// <summary>
/// Gets the constant's value. Strings include delimiting quotes.
/// </summary>
public string Value
{
    get
    {
        return RegexMatch.Groups["value"].Value;
    }
}

I mean, I could implement some additional logic in here, to do what I can't do with a regex.


If problem #1 can be solved with a regex, then I believe problem #2 can be as well... or am I on the right track here? Should I ditch the [pretty complex] regex patterns and think of another way? I'm not too familiar with greedy subexpressions, backreferences and other more advanced regex features - is this what's limiting me, or it's just that I'm using the wrong hammer for this nail?

Note: it doesn't matter that the patterns potentially match illegal syntax - this code will only run against compilable VBA code.

like image 710
Mathieu Guindon Avatar asked Nov 07 '14 06:11

Mathieu Guindon


1 Answers

Let me go ahead and add the disclaimer on this one. This is absolutely not a good idea (but it was a fun challenge). The regex(s) I'm about to present will parse the test cases in the question, but they obviously are not bullet proof. Using a parser will save you a lot of headache later. I did try to find a parser for VBA, but came up empty handed (and I'm assuming everyone else has too).

Regex

For this to work nicely, you need to have some control over the VBA code coming in. If you can't do this, then you truly need to be looking at writing a parser instead of using Regexes. However, judging from what you already said, you may have a little bit of control. So maybe this will help out.

So for this, I had to split the regex into two distinct regexes. The reason for this is the .Net Regex library cannot handle capturing groups within a repeating group.

Capture the line and start parsing, this will place the variables (with the values) into a single group, but the second Regex will parse them. Just fyi, the regexes make use of negative lookbehinds.

^(?:(?<Accessibility>Private|Public|Global)\s)?Const\s(?<variable>[a-zA-Z][a-zA-Z0-9_]*(?:[%&@!#$])?(?:\sAs)?\s(?:(?:[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s[^',]+(?:(?:(?!"").)+"")?(?:,\s)?){1,}(?:'(?<comment>.+))?$

Regex Demo

Here's the regex to parse the variables

(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?:\sAs)?\s(?:(?<reference>[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s(?<value>[^',]+(?:(?:(?!").)+")?),?

Regex Demo

And here's some c# code you can toss in and test everything out. This should make it easy to test any edge cases you have.

static void Main(string[] args)
{
    List<String> test = new List<string> {
        "Const foo = 123",
        "Const foo$ = \"123\"",
        "Const foo As String = \"1'2'3\"",
        "Const foo As String = \"123\"",
        "Private Const foo = 123",
        "Public Const foo As Integer = 123",
        "Global Const foo% = 123",
        "Const foo = 123 'this comment is included as part of the value",
        "Const foo = 123, bar = 456",
        "'Const foo As String = \"123\"",
    };


    foreach (var str in test)
        Parse(str);

    Console.Read();
}

private static Regex parse = new Regex(@"^(?:(?<Accessibility>Private|Public|Global)\s)?Const\s(?<variable>[a-zA-Z][a-zA-Z0-9_]*(?:[%&@!#$])?(?:\sAs)?\s(?:(?:[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s[^',]+(?:(?:(?!"").)+"")?(?:,\s)?){1,}(?:'(?<comment>.+))?$", RegexOptions.Compiled | RegexOptions.Singleline, new TimeSpan(0, 0, 20));
private static Regex variableRegex = new Regex(@"(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?:\sAs)?\s(?:(?<reference>[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s(?<value>[^',]+(?:(?:(?!"").)+"")?),?", RegexOptions.Compiled | RegexOptions.Singleline, new TimeSpan(0, 0, 20));

public static void Parse(String str)
{
    Console.WriteLine(String.Format("Parsing: {0}", str));

    var match = parse.Match(str);

    if (match.Success)
    {
        //Private/Public/Global
        var accessibility = match.Groups["Accessibility"].Value;
        //Since we defined this with atleast one capture, there should always be something here.
        foreach (Capture variable in match.Groups["variable"].Captures)
        {
            //Console.WriteLine(variable);
            var variableMatch = variableRegex.Match(variable.Value);
            if (variableMatch.Success) 
            {
                Console.WriteLine(String.Format("Identifier: {0}", variableMatch.Groups["identifier"].Value));

                if (variableMatch.Groups["specifier"].Success)
                    Console.WriteLine(String.Format("specifier: {0}", variableMatch.Groups["specifier"].Value));

                if (variableMatch.Groups["reference"].Success)
                    Console.WriteLine(String.Format("reference: {0}", variableMatch.Groups["reference"].Value));

                Console.WriteLine(String.Format("value: {0}", variableMatch.Groups["value"].Value));

                Console.WriteLine("");
            }
            else
            {
                Console.WriteLine(String.Format("FAILED VARIABLE: {0}", variable.Value));
            }

        }

        if (match.Groups["comment"].Success)
        {
            Console.WriteLine(String.Format("Comment: {0}", match.Groups["comment"].Value));
        }
    }
    else
    {
        Console.WriteLine(String.Format("FAILED: {0}", str));
    }

    Console.WriteLine("+++++++++++++++++++++++++++++++++++++++++++++");
    Console.WriteLine("");
}

The c# code was just what I was using to test my theory, so I apologize for the craziness in it.

For completeness here's a small sample of the output. If you run the code you'll get more output, but this directly shows that it can handle the situations you were asking about.

Parsing: Const foo = 123 'this comment is included as part of the value
Identifier: foo
value: 123
Comment: this comment is included as part of the value


Parsing: Const foo = 123, bar = 456
Identifier: foo
value: 123

Identifier: bar
value: 456

What it handles

Here are the major cases I can think of that you're probably interested in. It should still handle everything you had before as I just added to the regex you provided.

  • Comments
  • Multiple variable declarations on a single line
  • The apostrophe (comment character) within a string value. Ie foo = "She's awesome"
  • If the line starts with a comment, the line should be ignored

What it doesn't handle

The one thing I didn't really handle was spacing, but it shouldn't be hard add that in yourself if you need it. So for instance if the declare multiple variables there MUST be a space after the comma. ie (VALID: foo = 123, foobar = 124) (INVALID: foo = 123,foobar = 124)

You won't get much leniency on the format from it, but there's not a whole lot you can do with that when using regexes.


Hope this helps you out, and if you need any more explanation on how any of this works just let me know. Just know this is a bad idea. You'll run into situations that the regex can't handle. If I was in your position, I'd be considering writing a simple parser which would give you greater flexibility in the long run. Good luck.

like image 60
Nathan Avatar answered Oct 13 '22 14:10

Nathan