I am totally new to regular expressions. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes and are not preceded by a '\'
Eg:-
He is a "man of his" words\ always
must be split as
He
is
a
"man of his"
words\ always
I understand
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(StringToBeMatched);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
l split the example string using all spaces that are not surrounded by single or double quotes
How do I incorporate the third condition of ignoring the white-space if it is preceded by a \ ??
You can use this regex:
((["']).*?\2|(?:[^\\ ]+\\\s+)+[^\\ ]+|\S+)
RegEx Demo
In Java:
Pattern regex = Pattern.compile (
"(([\"']).*?\\2|(?:[^\\\\ ]+\\\\\\s+)+[^\\\\ ]+|\\S+)" );
Explanation:
This regex works on alternation:
([\"']).*?\\2
to match any quoted (double or single) strings.(?:[^\\ ]+\\\s+)+[^\\ ]+
to match any string with escaped spaces.\S+
to match any word with no spaces.Anubhava's solution is nice...I particularly like his use of S+. My solution is similar in the groupings except for capturing on beginning and ending word boundaries in the third alternate group...
(?i)((?:(['|"]).+\2)|(?:\w+\\\s\w+)+|\b(?=\w)\w+\b(?!\w))
(?i)((?:(['|\"]).+\\2)|(?:\\w+\\\\\\s\\w+)+|\\b(?=\\w)\\w+\\b(?!\\w))
String subject = "He is a \"man of his\" words\\ always 'and forever'";
Pattern pattern = Pattern.compile( "(?i)((?:(['|\"]).+\\2)|(?:\\w+\\\\\\s\\w+)+|\\b(?=\\w)\\w+\\b(?!\\w))" );
Matcher matcher = pattern.matcher( subject );
while( matcher.find() ) {
System.out.println( matcher.group(0).replaceAll( subject, "$1" ));
}
He
is
a
"man of his"
words\ always
'and forever'
"(?i)" + // Match the remainder of the regex with the options: case insensitive (i)
"(" + // Match the regular expression below and capture its match into backreference number 1
// Match either the regular expression below (attempting the next alternative only if this one fails)
"(?:" + // Match the regular expression below
"(" + // Match the regular expression below and capture its match into backreference number 2
"['|\"]" + // Match a single character present in the list “'|"”
")" +
"." + // Match any single character that is not a line break character
"+" + // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"\\2" + // Match the same text as most recently matched by capturing group number 2
")" +
"|" + // Or match regular expression number 2 below (attempting the next alternative only if this one fails)
"(?:" + // Match the regular expression below
"\\w" + // Match a single character that is a “word character” (letters, digits, etc.)
"+" + // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"\\\\" + // Match the character “\” literally
"\\s" + // Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
"\\w" + // Match a single character that is a “word character” (letters, digits, etc.)
"+" + // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
")+" + // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"|" + // Or match regular expression number 3 below (the entire group fails if this one fails to match)
"\\b" + // Assert position at a word boundary
"(?=" + // Assert that the regex below can be matched, starting at this position (positive lookahead)
"\\w" + // Match a single character that is a “word character” (letters, digits, etc.)
")" +
"\\w" + // Match a single character that is a “word character” (letters, digits, etc.)
"+" + // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"\\b" + // Assert position at a word boundary
"(?!" + // Assert that it is impossible to match the regex below starting at this position (negative lookahead)
"\\w" + // Match a single character that is a “word character” (letters, digits, etc.)
")" +
")"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With