Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a string in java based on white spaces escaping those spaces in double quotes and single quotes and that which are preceded by \

Tags:

java

string

regex

I am totally new to regular expressions. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes and are not preceded by a '\'

Eg:-

He is a "man of his" words\ always

must be split as

He
is 
a 
"man of his"
words\ always

I understand

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(StringToBeMatched);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group());
}

l split the example string using all spaces that are not surrounded by single or double quotes

How do I incorporate the third condition of ignoring the white-space if it is preceded by a \ ??

like image 571
Sriram Manohar Avatar asked Dec 22 '14 18:12

Sriram Manohar


2 Answers

You can use this regex:

((["']).*?\2|(?:[^\\ ]+\\\s+)+[^\\ ]+|\S+)

RegEx Demo

In Java:

Pattern regex = Pattern.compile ( 
"(([\"']).*?\\2|(?:[^\\\\ ]+\\\\\\s+)+[^\\\\ ]+|\\S+)" );

Explanation:

This regex works on alternation:

  1. First match ([\"']).*?\\2 to match any quoted (double or single) strings.
  2. Then match (?:[^\\ ]+\\\s+)+[^\\ ]+ to match any string with escaped spaces.
  3. Finally Use \S+ to match any word with no spaces.
like image 68
anubhava Avatar answered Oct 14 '22 16:10

anubhava


Anubhava's solution is nice...I particularly like his use of S+. My solution is similar in the groupings except for capturing on beginning and ending word boundaries in the third alternate group...

RegEx

(?i)((?:(['|"]).+\2)|(?:\w+\\\s\w+)+|\b(?=\w)\w+\b(?!\w))

For Java

(?i)((?:(['|\"]).+\\2)|(?:\\w+\\\\\\s\\w+)+|\\b(?=\\w)\\w+\\b(?!\\w))

Example

String subject = "He is a \"man of his\" words\\ always 'and forever'";
Pattern pattern = Pattern.compile( "(?i)((?:(['|\"]).+\\2)|(?:\\w+\\\\\\s\\w+)+|\\b(?=\\w)\\w+\\b(?!\\w))" );
Matcher matcher = pattern.matcher( subject );
while( matcher.find() ) {
    System.out.println( matcher.group(0).replaceAll( subject, "$1" ));
}

Result

He
is
a
"man of his"
words\ always
'and forever'

Detailed Explanation

"(?i)" +                 // Match the remainder of the regex with the options: case insensitive (i)
"(" +                    // Match the regular expression below and capture its match into backreference number 1
                            // Match either the regular expression below (attempting the next alternative only if this one fails)
      "(?:" +                  // Match the regular expression below
         "(" +                    // Match the regular expression below and capture its match into backreference number 2
            "['|\"]" +                // Match a single character present in the list “'|"”
         ")" +
         "." +                    // Match any single character that is not a line break character
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         "\\2" +                   // Match the same text as most recently matched by capturing group number 2
      ")" +
   "|" +                    // Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      "(?:" +                  // Match the regular expression below
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         "\\\\" +                   // Match the character “\” literally
         "\\s" +                   // Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      ")+" +                   // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   "|" +                    // Or match regular expression number 3 below (the entire group fails if this one fails to match)
      "\\b" +                   // Assert position at a word boundary
      "(?=" +                  // Assert that the regex below can be matched, starting at this position (positive lookahead)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
      ")" +
      "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
         "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      "\\b" +                   // Assert position at a word boundary
      "(?!" +                  // Assert that it is impossible to match the regex below starting at this position (negative lookahead)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
      ")" +
")"  
like image 24
Edward J Beckett Avatar answered Oct 14 '22 16:10

Edward J Beckett