Regular expression troubles, escaped quotes

Question

Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell

Say I have the following string

"Hello\" World" "Hello Universe" Hi

How could I turn it into a 3 element list

Hello" World
Hello Universe
Hi

The following is my first attempt, but it's got a number of problems

It leaves the quote characters
It doesn't catch the escaped quote

Code:

public void test() {
    String str = "\"Hello\\" World\" \"Hello Universe\" Hi";
    List<String> list = split(str);
}

public static List<String> split(String str) {
    Pattern pattern = Pattern.compile(
        "\"[^\"]*\"" + /* double quoted token*/
        "|'[^']*'" + /*single quoted token*/
        "|[A-Za-z']+" /*everything else*/
    );

    List<String> opts = new ArrayList<String>();
    Scanner scanner = new Scanner(str).useDelimiter(pattern);

    String token;
    while ((token = scanner.findInLine(pattern)) != null) {
        opts.add(token);
    }
    return opts;
}

So the incorrect output of the following code is

"Hello\"
World
" "
Hello
Universe
Hi

EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind

Paul W · Accepted Answer

If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:

public static List<String> tokenize(String s) throws IOException {
    List<String> opts = new ArrayList<String>();
    StreamTokenizer st = new StreamTokenizer(new StringReader(s));
    st.quoteChar('\"');
    while (st.nextToken() != StreamTokenizer.TT_EOF) {
        opts.add(st.sval);
    }

    return opts;
}

If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):

   public static List<String> splitSSV(String in) throws IOException {
        ArrayList<String> out = new ArrayList<String>();

        StringReader r = new StringReader(in);
        StringBuilder b = new StringBuilder();
        int inQuote = -1;
        boolean escape = false;
        int c;
        // read each character
        while ((c = r.read()) != -1) {
            if (escape) {  // if the previous char is escape, add the current char
                b.append((char)c);
                escape = false;
                continue;
            }
            switch (c) {
            case '\':   // deal with escape char
                escape = true;
                break;
            case '\"':
            case '\'':  // deal with quote chars
                if (c == '\"' || c == '\'') {
                    if (inQuote == -1) {  // not in a quote
                        inQuote = c;  // now we are
                    } else {
                        inQuote = -1;  // we were in a quote and now we aren't
                    }
                }
                break;
            case ' ':
                if (inQuote == -1) {  // if we aren't in a quote, then add token to list
                    out.add(b.toString());
                    b.setLength(0);
                } else {
                    b.append((char)c); // else append space to current token
                }
                break;
            default:
                b.append((char)c);  // append all other chars to current token
            }
        }
        if (b.length() > 0) {
            out.add(b.toString()); // add final token to list
        }
        return out;
    }

OpenSauce · Answer

I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.

David Schmitt · Answer

To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.

Step 1: tokenize the input: `/([ ]+)|(\")|(")|([^ "]+)/`

This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.

Step 2: build a finite state machine matching and reacting to the tokens:

State: START

SPACE -> return empty string
ESCAPED_QUOTE -> Error (?)
QUOTE -> State := WITHIN_QUOTES
TEXT -> return text

State: WITHIN_QUOTES

SPACE -> add value to accumulator
ESCAPED_QUOTE -> add quote to accumulator
QUOTE -> return and clear accumulator; State := START
TEXT -> add text to accumulator

Regular expression troubles, escaped quotes

Tags:

java

regex

unix

command-line

Glen

3 Answers

Paul W

OpenSauce

Step 1: tokenize the input: `/([ \t]+)|(\\")|(")|([^ \t"]+)/`

Step 2: build a finite state machine matching and reacting to the tokens:

State: START

State: WITHIN_QUOTES

Step 3: Profit!!

David Schmitt

Recent Activity

Donate For Us

Regular expression troubles, escaped quotes

Tags:

java

regex

unix

command-line

Glen

3 Answers

Paul W

OpenSauce

Step 1: tokenize the input: /([ \t]+)|(\\")|(")|([^ \t"]+)/

Step 2: build a finite state machine matching and reacting to the tokens:

State: START

State: WITHIN_QUOTES

Step 3: Profit!!

David Schmitt

Related questions

Recent Activity

Donate For Us

Step 1: tokenize the input: `/([ \t]+)|(\\")|(")|([^ \t"]+)/`