I have a string like aa | bb | "cc | dd" | 'ee | ff'
and I'm looking for a way to split this to get all the values separated by the |
character with exeption for |
contained in strings.
The idea is to get something like this [a, b, "cc | dd", 'ee | ff']
I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467
However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?
To split a string with specific character as delimiter in Java, call split() method on the string object, and pass the specific character as argument to the split() method. The method returns a String Array with the splits as elements in the array.
split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.
$ means "Match the end of the string" (the position after the last character in the string).
Introduction to the Python regex split() function The built-in re module provides you with the split() function that splits a string by the matches of a regular expression. In this syntax: pattern is a regular expression whose matches will be used as separators for splitting. string is an input string to split.
Split(String) Split(String) Split(String) Split(String) Splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor. Splits an input string into an array of substrings at the positions defined by a regular expression pattern.
Splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor. Splits an input string into an array of substrings at the positions defined by a regular expression pattern.
The Pythons re module’s re.split () method split the string by the occurrences of the regex pattern, returning a list containing the resulting substrings. After reading this article you will be able to perform the following split operations using regex in Python. Split the string by each occurrence of the pattern.
If a match is found at the beginning or the end of the input string, an empty string is included at the beginning or the end of the returned array. The following example uses the regular expression pattern \d+ to split an input string on numeric characters.
This is easily done with the (*SKIP)(*FAIL)
functionality pcre
offers:
(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*
In PHP
this could be:
<?php
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';
$splitted = preg_split($pattern, $string);
print_r($splitted);
?>
And would yield
Array
(
[0] => aa
[1] => bb
[2] => "cc | dd"
[3] => 'ee | ff'
)
See a demo on regex101.com and on ideone.com.
This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:
$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';
$pattern = <<<'PATTERN'
(
(?:[|[]|^) # after | or [ or string start
\s*
(?<token> # name the match
"[^"]*" # string in double quotes
|
'[^']*' # string in single quotes
|
[^\s|]+ # non-whitespace
)
\s*
)x
PATTERN;
preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);
Output:
array(4) {
[0]=>
string(2) "aa"
[1]=>
string(2) "bb"
[2]=>
string(9) ""cc | dd""
[3]=>
string(9) "'ee | ff'"
}
<<<'PATTERN'
is called HEREDOC syntax and cuts down on escaping()
as pattern delimiters - they are group 0x
allows to indent and comment the patternUse
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));
See PHP proof.
Results:
Array
(
[0] => aa
[1] => bb
[2] => cc | dd
[3] => ee | ff
)
EXPLANATION
--------------------------------------------------------------------------------
(?| Branch reset group, does not capture:
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^\"]* any character except: '\"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^|'\"]+ any character except: '|', ''', '\"'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of grouping
It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to @Jan's answer.
(['"]).*?\1\K| *\| *
PCRE Demo
(['"]) # match a single or double quote and save to capture group 1
.*? # match zero or more characters lazily
\1 # match the content of capture group 1
\K # reset the starting point of the reported match and discard
# any previously-consumed characters from the reported match
| # or
\ * # match zero or more spaces
\| # match a pipe character
\ * # match zero or more spaces
Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With