Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenizing a String but ignoring delimiters within quotes

Tags:

java

I wish to have have the following String

!cmd 45 90 "An argument" Another AndAnother "Another one in quotes"

to become an array of the following

{ "!cmd", "45", "90", "An argument", "Another", "AndAnother", "Another one in quotes" }

I tried

new StringTokenizer(cmd, "\"")

but this would return "Another" and "AndAnother as "Another AndAnother" which is not the desired effect.

Thanks.

EDIT: I have changed the example yet again, this time I believe it explains the situation best although it is no different than the second example.

like image 243
Ploo Avatar asked Sep 05 '25 17:09

Ploo


2 Answers

It's much easier to use a java.util.regex.Matcher and do a find() rather than any kind of split in these kinds of scenario.

That is, instead of defining the pattern for the delimiter between the tokens, you define the pattern for the tokens themselves.

Here's an example:

    String text = "1 2 \"333 4\" 55 6    \"77\" 8 999";
    // 1 2 "333 4" 55 6    "77" 8 999

    String regex = "\"([^\"]*)\"|(\\S+)";

    Matcher m = Pattern.compile(regex).matcher(text);
    while (m.find()) {
        if (m.group(1) != null) {
            System.out.println("Quoted [" + m.group(1) + "]");
        } else {
            System.out.println("Plain [" + m.group(2) + "]");
        }
    }

The above prints (as seen on ideone.com):

Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]

The pattern is essentially:

"([^"]*)"|(\S+)
 \_____/  \___/
    1       2

There are 2 alternates:

  • The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
  • The second alternate matches any sequence of non-whitespace characters, captured in group 2
  • The order of the alternates matter in this pattern

Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher solution still works.

References

  • regular-expressions.info/Brackets for Grouping and Capturing, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus

See also

  • regular-expressions.info/Examples - Programmer - Strings - for pattern with escaped quotes

Appendix

Note that StringTokenizer is a legacy class. It's recommended to use java.util.Scanner or String.split, or of course java.util.regex.Matcher for most flexibility.

Related questions

  • Difference between a Deprecated and Legacy API?
  • Scanner vs. StringTokenizer vs. String.Split
  • Validating input using java.util.Scanner - has many examples
like image 143
polygenelubricants Avatar answered Sep 07 '25 07:09

polygenelubricants


Do it the old fashioned way. Make a function that looks at each character in a for loop. If the character is a space, take everything up to that (excluding the space) and add it as an entry to the array. Note the position, and do the same again, adding that next part to the array after a space. When a double quote is encountered, mark a boolean named 'inQuote' as true, and ignore spaces when inQuote is true. When you hit quotes when inQuote is true, flag it as false and go back to breaking things up when a space is encountered. You can then extend this as necessary to support escape chars, etc.

Could this be done with a regex? I dont know, I guess. But the whole function would take less to write than this reply did.

like image 30
GrandmasterB Avatar answered Sep 07 '25 07:09

GrandmasterB