Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent to StringTokenizer with multiple characters delimiters

Tags:

java

I try to split a String into tokens.

The token delimiters are not single characters, some delimiters are included into others (example, & and &&), and I need to have the delimiters returned as token.
StringTokenizer is not able to deal with multiple characters delimiters. I presume it's possible with String.split, but fail to guess the magical regular expression that will suits my needs.

Any idea ?

Example:

Token delimiters: "&", "&&", "=", "=>", " "  
String to tokenize: a & b&&c=>d  
Expected result: an string array containing "a", " ", "&", " ", "b", "&&", "c", "=>", "d"

--- Edit ---
Thanks to all for your help, Dasblinkenlight gives me the solution. Here is the "ready to use" code I wrote with his help:

private static String[] wonderfulTokenizer(String string, String[] delimiters) {
  // First, create a regular expression that matches the union of the delimiters
  // Be aware that, in case of delimiters containing others (example && and &),
  // the longer may be before the shorter (&& should be before &) or the regexpr
  // parser will recognize && as two &.
  Arrays.sort(delimiters, new Comparator<String>() {
    @Override
    public int compare(String o1, String o2) {
      return -o1.compareTo(o2);
     }
  });
  // Build a string that will contain the regular expression
  StringBuilder regexpr = new StringBuilder();
  regexpr.append('(');
  for (String delim : delimiters) { // For each delimiter
    if (regexpr.length() != 1) regexpr.append('|'); // Add union separator if needed
    for (int i = 0; i < delim.length(); i++) {
      // Add an escape character if the character is a regexp reserved char
      regexpr.append('\\');
      regexpr.append(delim.charAt(i));
    }
  }
  regexpr.append(')'); // Close the union
  Pattern p = Pattern.compile(regexpr.toString());

  // Now, search for the tokens
  List<String> res = new ArrayList<String>();
  Matcher m = p.matcher(string);
  int pos = 0;
  while (m.find()) { // While there's a delimiter in the string
    if (pos != m.start()) {
      // If there's something between the current and the previous delimiter
      // Add it to the tokens list
      res.add(string.substring(pos, m.start()));
    }
    res.add(m.group()); // add the delimiter
    pos = m.end(); // Remember end of delimiter
  }
  if (pos != string.length()) {
    // If it remains some characters in the string after last delimiter
    // Add this to the token list
    res.add(string.substring(pos));
  }
  // Return the result
  return res.toArray(new String[res.size()]);
}

It could be optimize if you have many strings to tokenize by creating the Pattern only one time.

like image 589
Jean-Marc Astesana Avatar asked Aug 31 '12 12:08

Jean-Marc Astesana


People also ask

How do I use StringTokenizer with multiple delimiters?

In order to break String into tokens, you need to create a StringTokenizer object and provide a delimiter for splitting strings into tokens. You can pass multiple delimiters e.g. you can break String into tokens by, and: at the same time. If you don't provide any delimiter then by default it will use white-space.

Is StringTokenizer deprecated?

StringTokenizer is a legacy class (i.e. there is a better replacement out there), but it's not deprecated.

What is the difference between StringTokenizer and split?

The split() method is preferred and recommended even though it is comparatively slower than StringTokenizer. This is because it is more robust and easier to use than StringTokenizer. A token is returned by taking a substring of the string that was used to create the StringTokenizer object.

What is the default delimiter that is used by StringTokenizer?

Constructs a string tokenizer for the specified string. The tokenizer uses the default delimiter set, which is " \t\n\r\f" : the space character, the tab character, the newline character, the carriage-return character, and the form-feed character.


2 Answers

You can use the Pattern and a simple loop to achieve the results that you are looking for:

List<String> res = new ArrayList<String>();
Pattern p = Pattern.compile("([&]{1,2}|=>?| +)");
String s = "s=a&=>b";
Matcher m = p.matcher(s);
int pos = 0;
while (m.find()) {
    if (pos != m.start()) {
        res.add(s.substring(pos, m.start()));
    }
    res.add(m.group());
    pos = m.end();
}
if (pos != s.length()) {
    res.add(s.substring(pos));
}
for (String t : res) {
    System.out.println("'"+t+"'");
}

This produces the result below:

's'
'='
'a'
'&'
'=>'
'b'
like image 69
Sergey Kalinichenko Avatar answered Nov 03 '22 07:11

Sergey Kalinichenko


Split won't do it for you as it removed the delimeter. You probably need to tokenize the string on your own (i.e. a for-loop) or use a framework like http://www.antlr.org/

like image 21
mwikblom Avatar answered Nov 03 '22 08:11

mwikblom