Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for splitting a string using space when not surrounded by single or double quotes

Tags:

java

regex

split

People also ask

How do you split a string when there is a space?

To split a string with space as delimiter in Java, call split() method on the string object, with space " " passed as argument to the split() method. The method returns a String Array with the splits as elements in the array.

How do you split a string by the occurrences of a regex pattern?

Introduction to the Python regex split() function The built-in re module provides you with the split() function that splits a string by the matches of a regular expression. In this syntax: pattern is a regular expression whose matches will be used as separators for splitting. string is an input string to split.

Can we use regex in split a string?

split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.

Do I need to escape quotes in regex?

In order to use a literal ^ at the start or a literal $ at the end of a regex, the character must be escaped. Some flavors only use ^ and $ as metacharacters when they are at the start or end of the regex respectively. In those flavors, no additional escaping is necessary. It's usually just best to escape them anyway.


I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:

[^\s"']+|"([^"]*)"|'([^']*)'

I added the capturing groups because you don't want the quotes in the list.

This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    if (regexMatcher.group(1) != null) {
        // Add double-quoted string without the quotes
        matchList.add(regexMatcher.group(1));
    } else if (regexMatcher.group(2) != null) {
        // Add single-quoted string without the quotes
        matchList.add(regexMatcher.group(2));
    } else {
        // Add unquoted word
        matchList.add(regexMatcher.group());
    }
} 

If you don't mind having the quotes in the returned list, you can use much simpler code:

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group());
} 

There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:

  • parsings strings: extracting words and phrases
  • Best way to parse Space Separated Text

UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?

m/('.*?'|".*?"|\S+)/g 

Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).

This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.

Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)


If you want to allow escaped quotes inside the string, you can use something like this:

(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))

Quoted strings will be group 2, single unquoted words will be group 3.

You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/


The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com). If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:

("[^"]*"|'[^']*'|[\S]+)+