Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match a method block using regex?

Tags:

java

regex

Take an example.

 public static FieldsConfig getFieldsConfig(){
    if(xxx) {
      sssss;
    }
   return;
}

I write a regex, "\\s*public\\s*static.*getFieldsConfig\\(.*\\)\\s*\\{"

It can match only the first line. But how to match right to the last "}" of the method?

Help me. Thanks.

Edit: The content of method {} is not specified. But pattern is surely like this,

  public static xxx theKnownMethodName(xxxx) {
    xxxxxxx
  }
like image 744
Victor Choy Avatar asked Mar 10 '16 09:03

Victor Choy


People also ask

How do I match a pattern in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).

How do I match a number in regex?

To match any number from 0 to 9 we use \d in regex. It will match any single digit number from 0 to 9. \d means [0-9] or match any number from 0 to 9. Instead of writing 0123456789 the shorthand version is [0-9] where [] is used for character range.

How do you match a dot in regex?

in regex is a metacharacter, it is used to match any character. To match a literal dot in a raw Python string ( r"" or r'' ), you need to escape it, so r"\." Unless the regular expression is stored inside a regular python string, in which case you need to use a double \ ( \\ ) instead.


5 Answers

I decided to take it one step further ;)

Here's a regex that'll give you the modifiers, type, name and body of a function in different capture groups:

((?:(?:public|private|protected|static|final|abstract|synchronized|volatile)\s+)*)
\s*(\w+)\s*(\w+)\(.*?\)\s*({(?:{[^{}]*}|.)*?})

It handles nested braces (@callOfCode it is (semi-)possible with regex ;) and a fixed set of modifiers.

It doesn't handle complicated stuff like braces inside comments and stuff like that, but it'll work for the simplest ones.

Regards

Regex101 sample here

Edit: And to answer your question ;), what you're interested in is capture group 4.

Edit 2: As I said - simple ones. But you could make it more complicated to handle more complicated methods. Here's an updated handling one more level of nesting.

((?:(?:public|private|protected|static|final|abstract|synchronized|volatile)\s+)*)
\s*(\w+)\s*(\w+)\(.*?\)\s*({(?:{[^{}]*(?:{[^{}]*}|.)*?[^{}]*}|.)*?})

And you could another level... and another... But as someone commented - this shouldn't be done by regex. This however handles simple methods.

like image 135
SamWhan Avatar answered Sep 30 '22 17:09

SamWhan


Regex is definitely not the best tool for that, but if you want regex, and your code is well indented, you can try with:

^(?<indent>\s*)(?<mod1>\w+)\s(?<mod2>\w+)?\s*(?<mod3>\w+)?\s*(?<return>\b\w+)\s(?<name>\w+)\((?<arg>.*?)\)\s*\{(?<body>.+?)^\k<indent>\}

DEMO

It has additional named groups, you can delete them. It use a indentation level to find last }.

like image 30
m.cekiera Avatar answered Sep 30 '22 17:09

m.cekiera


Victor, you've asked me to refer your answer. So I decided to take a time to write full review of it and give some hints. I'm not some kind of regex professional nor like it very much. Currently I'm working on a project that uses regex heavily so I've seen and wrote enaugh of it to answer your question pretty reliably as well as get sick of regexes. So let's start your regex analysis:

String regex ="\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))";

String regex2 = "\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=(\\s*}\\s*$))";

regex = "(" + regex +")|("+ regex2 + "){1}?";

I see you've made it of three parts for readability. That's a good idea. I'll start from first part :

  • \\s\*public\\s\*static.*getFieldsConfig You allow any number, including zero whitespaces between public and static. It could match publicstatic. Everytime use \\s+ between words that must be separated with some number of whitespaces.
  • (.\*?\\)\\s\*\\{.\*\\} You allow anything to appear between first parantheses. It would match any symbol until ). Now we reached the part that makes your regex work not as you've wanted. \\{.*\\} is a major mistake. It will match everything until last } before last in file any of public private protected static is reached. I've pasted your getFieldsConfig method to java file and tested it. Using only first part of your regex ("\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))") mached everything from your method until last method in file.

There is no point to analyze step by step other parts, because \\{.*\\} ruins everything. In second part (regex2) you've mached anything from your method to last } in file. Have you tried to print what your regex is matching? Try it:

package com.tryRegex;

import java.io.File;
import java.io.IOException;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TryRegex{

    public static void main(String[] args) throws IOException{
        File yourFile = new File("tryFile.java");
        Scanner scanner = new Scanner(yourFile, "UTF-8");
        String text = scanner.useDelimiter("\\A").next();  // `\\A` marks beginning of file. Since file has only one beginning, it will scan file from start to beginning.

        String regex ="\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))";
        String regex2 = "\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=(\\s*}\\s*$))";
        regex = "(?s)(" + regex +")|("+ regex2 + "){1}?";     // I've included (?s) since we reading from file newline chars are not excluded. Without (?s) it would match anything unless your method is written in a single line.

        Matcher m = Pattern.compile(regex).matcher(text);

        System.out.println(m.find() ? m.group() : "No Match found");
    }
}

Short and simple piece of code to show how your regex works. Handle exception if you want. Just put yourFile.java to your project folder and run it.

Now I will show you how messy regexes actually is:

String methodSignature = "(\\s*((public|private|protected|static|final|abstract|synchronized|volatile)\\s+)*[\\w<>\\[\\]\\.]+\\s+\\w+\\s*\\((\\s*[\\w<>\\[\\]\\.]*\\s+\\w+\\s*,?)*\\s*\\))";
String regex = "(?s)" + methodSignature + ".*?(?="+ methodSignature + ")";

Basically this regex matches every method. But it also has flaws. I will explore it as well as it's flaws.

  • \\s*((public|private|protected|static|final|abstract|synchronized|volatile)\\s+)* Matches any of specified modifiers (and at least one whitespace) any times including zero, since method could have no modifier. (I've left number of modifiers allowed unlimited for the sake of simplicity. In real parser I wouldn't allow this as well as wouldn't use regex for such task.)
  • [\\w<>\\[\\]\\.]+ This is the method's return type. It can contain word characters, <> for generic types, [] for arrays, and . for nested class notation.
  • \\s+\\w+\\s*\\ Name of the method.
  • \\((\\s*[\\w<>\\[\\]\\.]*\\s+\\w+\\s*,?)*\\s*\\)) Especially tricky part - method paramethers. At first you can think that this part could be easily replaced with (. I thought this too. But then I've noticed that it matches not only methods, but anonymous classes too such as new Anonymous(someVariable){....} Simplest and most efficient way to avoid this is by specifying method parameters structure. [\\w<>\\[\\]\\.] is possible symbols that parameter type could be made of. \\s+\\w+\\s*,? Parameter type is followed by at least one whitespace and parameter name. Parameter name may be followed by , if method contains more than one parameter.

So what's about flaws? Major flaw is classes that is defined in methods. Method can contain class definitions in it. Consider this situation:

public void regexIsAGoodThing(){
  //some code
  new RegexIsNotSoGoodActually(){
    void dissapontingMethod(){
       //Efforts put in writing this regex was pointless because of this dissapointing method.
    }
  }
}

This explains very well why regex is not a proper tool for such job. It is not possible to parse method from java file reliably because method may be nested structure. Method may contain class definitions and these classes can contain methods that has another class definitions and so on. Regex is caught by infinite recursion and fails.

Another case were regex would fail is comments. In comments you can type anything.

void happyRegexing(){
     return void;
     // public void happyRegexingIsOver(){....}
}

One more thing that we cannot forget is annotations. What if next method is annotated? That regex will match almost fine, except that it will match annotation too. This can be avoided but then regex will be even larger.

public void goodDay(){

}

@Zzzzz //This annotation can be carried out by making our regex even more larger
public void goodNight(){

}

Another one case would be blocks. What if between two methods will be either static or instance block included?

public void iWillNotDoThisAnyMore(){

}

static{
    //some code
}

public void iWillNotParseCodeWithRegex(){
    //end of story
}

P.S It has another flaw - it matches new SomeClass() and everything until next method signature. You can work around this, but again - this would be work around but not an elegant code. And I haven't included end of file matching. Maybe I will add edit tomorrow if your'e interested. Going to sleep now, it's close to morning in Europe. As you can see, regex is almost good tool for most of tasks. But we, programmers, hate word almost. We do not even have it in our vocabularies. Aren't we?

like image 35
callOfCode Avatar answered Sep 30 '22 18:09

callOfCode


Try this

((?<space>\h+)public\s+static\s+[^(]+\([^)]*?\)\s*\{.*?\k<space>\})|(public\s+static\s+[^(]+\([^)]*?\)\s*\{.*?\n\})

Explanation:
We will capture method block start by keyword public end to }, public and } must have the same \s character so your code must be well format : ) https://en.wikipedia.org/wiki/Indent_style

\h: match whitespace but not newlines
(?<space>\h+): Get all whitespace before public then group in space name
public\s+static\s public static
[^(]: any character but not (
([^)]: any but not )
\k<space>\}: } same number of whitespace then } at the end.

Demo

Input:

public static FieldsConfig getFieldsConfig(){
    if(xxx) {
      sssss;
    }
   return;
}

NO CAPTURE

public static FieldsConfig getFieldsConfig2(){
    if(xxx) {
      sssss;
    }
   return;
}

NO CAPTURE

    public static FieldsConfig getFieldsConfig3(){
        if(xxx) {
          sssss;
        }
       return;
    }

NO CAPTURE

        public static FieldsConfig getFieldsConfig4(){
            if(xxx) {
              sssss;
            }
           return;
        }

Output:

MATCH 1
3.  [0-91]  `public static FieldsConfig getFieldsConfig(){
    if(xxx) {
      sssss;
    }
   return;
}`

MATCH 2
3.  [105-197]   `public static FieldsConfig getFieldsConfig2(){
    if(xxx) {
      sssss;
    }
   return;
}`

MATCH 3
1.  [211-309]   `   public static FieldsConfig getFieldsConfig3(){
        if(xxx) {
          sssss;
        }
       return;
    }`

MATCH 4
1.  [324-428]   `       public static FieldsConfig getFieldsConfig4(){
            if(xxx) {
              sssss;
            }
           return;
        }`
like image 21
Tim007 Avatar answered Sep 30 '22 19:09

Tim007


Thank all of you. After some consideration, I work out a reliable way to some degree in my situation. Now share it.

String regex ="\\s*public\s+static\s+[\w\.\<\>,\s]+\s+getFieldsConfig\\(.*?\\)\\s*\\{.*?\\}(?=\\s*(public|private|protected|static))";

String regex2 = "\\s*public\s+static\s+[\w\.\<\>,\s]+\s+getFieldsConfig\\(.*?\\)\\s*\\{.*?\\}(?=(\\s*}\\s*$))";

regex = "(" + regex +")|("+ regex2 + "){1}?";

Pattern pattern = Pattern.compile(regex, Pattern.DOTALL)

It can match my method body well.

PS Yes, the regex maybe not the suitable way to parse a method very strictly. Generally speaking, regex is less effort than programming and work right in specific situation. Adjust it and Sure it works for you.

like image 29
Victor Choy Avatar answered Sep 30 '22 19:09

Victor Choy