I am trying to create a Java regex that will replace all occurrences of white space in a string with a single space, except if that white space occurs between quotes (single or double)
If I were just looking for double quotes, I could use a look ahead:
text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " ");
And if I were just looking for single quotes, I could use a similar pattern.
The trick is finding both.
I had the great idea to run the double quote pattern followed by the single quote pattern, but of course that ended up replacing all spaces regardless of quotes.
So here are some tests and expected results
a b c d e --> a b c d e
a b "c d" e --> a b "c d" e
a b 'c d' e --> a b 'c d' e
a b "c d' e --> a b "c d' e (Can't mix and match quotes)
Is there any way to accomplish this in Java regex?
Assume invalid input was already verified separately. So none of the following will ever occur:
a "b c ' d
a 'b " c' d
a 'b c d
To check if the string has double quotes you can use: text_line. Contains("\""); Here \" will escape the double-quote.
To add quoted strings inside of strings, you need to escape the quotation marks. This happens by placing a backslash ( \ ) before the escaped character.
\"
and \'
and multi-line quotes.Several optimisations to reduce the number of steps:
Word1 Word2
(two spaces in between words)'example' another_word
(two spaces in between words)/wp-includes/media.php
file\G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)
https://regex101.com/r/wT6tU2/4
$1
(yes there is a space at the end)
try {
String resultString = subjectString.replaceAll("\\G((?:[^\\s\"']+| (?!\\s)|\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*+)(\\s+)", "$1 ");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
// Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
// Non-existent backreference used the replacement text
}
// \G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)
//
// Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Default line breaks; Regex syntax only
//
// Assert position at the end of the previous match (the start of the string for the first attempt) «\G»
// Match the regex below and capture its match into backreference number 1 «((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)»
// Match the regular expression below «(?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+»
// Between zero and unlimited times, as many times as possible, without giving back (possessive) «*+»
// Match this alternative (attempting the next alternative only if this one fails) «[^\s"']+»
// Match any single character NOT present in the list below «[^\s"']+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// A “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
// A single character from the list “"'” «"'»
// Or match this alternative (attempting the next alternative only if this one fails) « (?!\s)»
// Match the character “ ” literally « »
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\s)»
// Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
// Or match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"»
// Match the character “"” literally «"»
// Match any single character NOT present in the list below «[^"\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “"” «"»
// The backslash character «\\»
// Match the regular expression below «(?:\\.[^"\\]*)*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the backslash character «\\»
// Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
// Match any single character NOT present in the list below «[^"\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “"” «"»
// The backslash character «\\»
// Match the character “"” literally «"»
// Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'»
// Match the character “'” literally «'»
// Match any single character NOT present in the list below «[^'\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “'” «'»
// The backslash character «\\»
// Match the regular expression below «(?:\\.[^'\\]*)*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the backslash character «\\»
// Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
// Match any single character NOT present in the list below «[^'\\]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// The literal character “'” «'»
// The backslash character «\\»
// Match the character “'” literally «'»
// Match the regex below and capture its match into backreference number 2 «(\s+)»
// Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
I would recommend standardizing your string encapsulation. the use a regex to replace the alternate to the standard. lets say you settle on double quotes " then you could split your string on " and all your odd elements are quoted contents and your even elements will be unquoted, run your regex replace on only the even elements and rebuild your string from the altered array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With