Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect space that is not quoted or double quoted

I am trying to create a Java regex that will replace all occurrences of white space in a string with a single space, except if that white space occurs between quotes (single or double)

If I were just looking for double quotes, I could use a look ahead:

text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " ");

And if I were just looking for single quotes, I could use a similar pattern.

The trick is finding both.

I had the great idea to run the double quote pattern followed by the single quote pattern, but of course that ended up replacing all spaces regardless of quotes.

So here are some tests and expected results

a   b   c    d   e   -->  a b c d e
a   b   "c    d"   e -->  a b "c    d" e
a   b   'c    d'   e -->  a b 'c    d' e
a   b   "c    d'   e -->  a b "c d' e    (Can't mix and match quotes)

Is there any way to accomplish this in Java regex?

Assume invalid input was already verified separately. So none of the following will ever occur:

a "b c ' d
a 'b " c' d
a 'b c d
like image 708
Victor Grazi Avatar asked Dec 17 '15 20:12

Victor Grazi


People also ask

How do you know if a string has a double quote?

To check if the string has double quotes you can use: text_line. Contains("\""); Here \" will escape the double-quote.

How do you escape quotes in Python?

To add quoted strings inside of strings, you need to escape the quotation marks. This happens by placing a backslash ( \ ) before the escaped character.


2 Answers

Supports

  • escaping quotes via \" and \' and multi-line quotes.
  • unmatched quotes where quotes are terminated by the end of the string.
  • additional optimisations for large files

Optimisations

Several optimisations to reduce the number of steps:

Example 1 - for the string Word1 Word2 (two spaces in between words)

  • @sln's version here takes ~241 steps
  • this version takes just ~29 steps

Example 2 - for the string 'example' another_word (two spaces in between words)

  • @sln's version here takes ~28,714 steps
  • this version takes just ~36 steps

Example 3 - for WordPress's /wp-includes/media.php file

  • @sln's version here causes catastrophic backtracking error
  • this version takes just ~122,701 steps

Regular Expression

\G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)

https://regex101.com/r/wT6tU2/4

Replacement

$1 (yes there is a space at the end)

Visualisation

RegEx visualisation

Code

try {
    String resultString = subjectString.replaceAll("\\G((?:[^\\s\"']+| (?!\\s)|\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*+)(\\s+)", "$1 ");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text
}

Human Readable

// \G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+)
// 
// Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Default line breaks; Regex syntax only
// 
// Assert position at the end of the previous match (the start of the string for the first attempt) «\G»
// Match the regex below and capture its match into backreference number 1 «((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)»
//    Match the regular expression below «(?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+»
//       Between zero and unlimited times, as many times as possible, without giving back (possessive) «*+»
//       Match this alternative (attempting the next alternative only if this one fails) «[^\s"']+»
//          Match any single character NOT present in the list below «[^\s"']+»
//             Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//             A “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
//             A single character from the list “"'” «"'»
//       Or match this alternative (attempting the next alternative only if this one fails) « (?!\s)»
//          Match the character “ ” literally « »
//          Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\s)»
//             Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
//       Or match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"»
//          Match the character “"” literally «"»
//          Match any single character NOT present in the list below «[^"\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “"” «"»
//             The backslash character «\\»
//          Match the regular expression below «(?:\\.[^"\\]*)*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             Match the backslash character «\\»
//             Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//             Match any single character NOT present in the list below «[^"\\]*»
//                Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//                The literal character “"” «"»
//                The backslash character «\\»
//          Match the character “"” literally «"»
//       Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'»
//          Match the character “'” literally «'»
//          Match any single character NOT present in the list below «[^'\\]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             The literal character “'” «'»
//             The backslash character «\\»
//          Match the regular expression below «(?:\\.[^'\\]*)*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//             Match the backslash character «\\»
//             Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.»
//             Match any single character NOT present in the list below «[^'\\]*»
//                Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//                The literal character “'” «'»
//                The backslash character «\\»
//          Match the character “'” literally «'»
// Match the regex below and capture its match into backreference number 2 «(\s+)»
//    Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
like image 95
Dean Taylor Avatar answered Oct 07 '22 04:10

Dean Taylor


I would recommend standardizing your string encapsulation. the use a regex to replace the alternate to the standard. lets say you settle on double quotes " then you could split your string on " and all your odd elements are quoted contents and your even elements will be unquoted, run your regex replace on only the even elements and rebuild your string from the altered array.

like image 22
ClanK Avatar answered Oct 07 '22 04:10

ClanK