Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex split string on a char with exception for inner-string

I have a string like aa | bb | "cc | dd" | 'ee | ff' and I'm looking for a way to split this to get all the values separated by the | character with exeption for | contained in strings.

The idea is to get something like this [a, b, "cc | dd", 'ee | ff']

I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467

However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?

like image 218
Xiidref Avatar asked Oct 16 '21 18:10

Xiidref


People also ask

How do you split a string at a certain character?

To split a string with specific character as delimiter in Java, call split() method on the string object, and pass the specific character as argument to the split() method. The method returns a String Array with the splits as elements in the array.

Can we use regex in split a string?

split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).

How do you split a string by the occurrences of a regex pattern?

Introduction to the Python regex split() function The built-in re module provides you with the split() function that splits a string by the matches of a regular expression. In this syntax: pattern is a regular expression whose matches will be used as separators for splitting. string is an input string to split.

What is string split in regex?

Split(String) Split(String) Split(String) Split(String) Splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor. Splits an input string into an array of substrings at the positions defined by a regular expression pattern.

What is the use of SPL in regex?

Splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor. Splits an input string into an array of substrings at the positions defined by a regular expression pattern.

How to split a string by the occurrence of a pattern?

The Pythons re module’s re.split () method split the string by the occurrences of the regex pattern, returning a list containing the resulting substrings. After reading this article you will be able to perform the following split operations using regex in Python. Split the string by each occurrence of the pattern.

How do you split an input string on numeric characters?

If a match is found at the beginning or the end of the input string, an empty string is included at the beginning or the end of the returned array. The following example uses the regular expression pattern \d+ to split an input string on numeric characters.


Video Answer


4 Answers

This is easily done with the (*SKIP)(*FAIL) functionality pcre offers:

(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*

In PHP this could be:

<?php

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";

$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';

$splitted = preg_split($pattern, $string);
print_r($splitted);
?>

And would yield

Array
(
    [0] => aa
    [1] => bb
    [2] => "cc | dd"
    [3] => 'ee | ff'
)

See a demo on regex101.com and on ideone.com.

like image 198
Jan Avatar answered Oct 21 '22 17:10

Jan


This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:

$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';

$pattern = <<<'PATTERN'
(
    (?:[|[]|^) # after | or [ or string start
    \s*
    (?<token> # name the match
        "[^"]*" # string in double quotes
        |
        '[^']*'  # string in single quotes
        |
        [^\s|]+ # non-whitespace 
    )
    \s*
)x
PATTERN;

preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);

Output:

array(4) {
  [0]=>
  string(2) "aa"
  [1]=>
  string(2) "bb"
  [2]=>
  string(9) ""cc | dd""
  [3]=>
  string(9) "'ee | ff'"
}

Hints:

  1. The <<<'PATTERN' is called HEREDOC syntax and cuts down on escaping
  2. I use () as pattern delimiters - they are group 0
  3. Naming matches makes code a lot more readable
  4. Modifier x allows to indent and comment the pattern
like image 3
ThW Avatar answered Oct 21 '22 17:10

ThW


Use

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));

See PHP proof.

Results:

Array
(
    [0] => aa
    [1] => bb
    [2] => cc | dd
    [3] => ee | ff
)

EXPLANATION

--------------------------------------------------------------------------------
  (?|                      Branch reset group, does not capture:
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^\"]*                   any character except: '\"' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^|'\"]+                 any character except: '|', ''', '\"'
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \|                       '|'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \z                       the end of the string
--------------------------------------------------------------------------------
  )                        end of grouping
like image 2
Ryszard Czech Avatar answered Oct 21 '22 16:10

Ryszard Czech


It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to @Jan's answer.

(['"]).*?\1\K| *\| *

PCRE Demo

(['"]) # match a single or double quote and save to capture group 1
.*?    # match zero or more characters lazily
\1     # match the content of capture group 1
\K     # reset the starting point of the reported match and discard
       # any previously-consumed characters from the reported match
|      # or
\ *    # match zero or more spaces
\|     # match a pipe character
\ *    # match zero or more spaces

Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.

like image 2
Cary Swoveland Avatar answered Oct 21 '22 15:10

Cary Swoveland