Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding tokens in a Java String

Tags:

java

string

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?

For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:

"hello[world]this[[is]me"

The output should be:

token[0] = "world"

token[1] = "[is"

(Note: the second token has a 'start' string in it)

like image 457
digiarnie Avatar asked Oct 26 '25 15:10

digiarnie


2 Answers

I think you can use the Apache Commons Lang feature that exists in StringUtils:

substringsBetween(java.lang.String str,
                  java.lang.String open,
                  java.lang.String close)

The API docs say it:

Searches a String for substrings delimited by a start and end tag, returning all matching substrings in an array.

The Commons Lang substringsBetween API can be found here:

http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)

like image 165
Jon Avatar answered Oct 29 '25 03:10

Jon


Here is the way I would go to avoid dependency on commons lang.

public static String escapeRegexp(String regexp){
    String specChars = "\\$.*+?|()[]{}^";
    String result = regexp;
    for (int i=0;i<specChars.length();i++){
        Character curChar = specChars.charAt(i);
        result = result.replaceAll(
            "\\"+curChar,
            "\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
    }
    return result;
}

public static List<String> findGroup(String content, String pattern, int group) {
    Pattern p = Pattern.compile(pattern);
    Matcher m = p.matcher(content);
    List<String> result = new ArrayList<String>();
    while (m.find()) {
        result.add(m.group(group));
    }
    return result;
}


public static List<String> tokenize(String content, String firstToken, String lastToken){
    String regexp = lastToken.length()>1
                    ?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
                    :escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
    return findGroup(content, regexp, 1);
}        

Use it like this :

String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");
like image 34
glmxndr Avatar answered Oct 29 '25 04:10

glmxndr