I am trying to retrieve specific fields from a text file which has a metadata as follows:
project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN
And I have the following script for retrieving the field 'cell'
while read line
do
cell="$(echo $line | cut -d";" -f7 )"
echo $cell
fi
done < files.txt
However the following script retrieves the whole field as cell=ABC
, whereas I just want the value 'ABC'
from the field, how do I retrieve the value after the regex, in the same line of code?
An additional binary operator, =~, is available, with the same precedence as == and != . When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise.
* - means "0 or more instances of the preceding regex token"
$BASH_REMATCH is an array and contains the matched text snippets. ${BASH_REMATCH[0]} contains the complete match. The remaining elements, e.g. ${BASH_REMATCH[1]} , contain the portion which were matched by () subexpressions.
If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running bash
, ksh
, or zsh
, consider using the regex-matching operator, =~
: [[ string =~ regex ]]
:
Tip of the hat to @Adrian Frühwirth for the gist of the ksh
and zsh
solutions.
Sample input string:
string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
Shell-specific use of =~
is discussed next; a multi-shell implementation of the =~
functionality via a shell function can be found at the end.
The special BASH_REMATCH
array variable receives the results of the matching operation: element 0
contains the entire match, element 1
the first capture group's (parenthesized subexpression's) match, and so on.
bash 3.2+
:
[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
bash 4.x
:
While the specific command above works, using regex literals in bash 4.x
is buggy, notably when involving word-boundary assertions \<
and \>
on Linux; e.g., [[ a =~ \<a ]]
inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): re='\a'; [[ a =~ $re ]]
works (also on bash 3.2+
).
bash 3.0 and 3.1
- or after setting shopt -s compat31
:
Quote the regex to make it work:
[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
The ksh
syntax is the same as in bash
, except:
.sh.match
(you must enclose the name in {...}
even when just implicitly referring to the first element with ${.sh.match}
):[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC'
The zsh
syntax is also similar to bash, except:
;
.
zsh
requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.$MATCH
contains the entire matched string$match
contains only the matches for the capture groups (note that zsh
arrays start with index 1
and that you don't need to enclose the variable name in {...}
to reference array elements) [[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC'
=~
operator as shell function reMatch
The following shell function abstracts away the differences between bash
, ksh
, zsh
with respect to the =~
operator; the matches are returned in array variable ${reMatches[@]}
.
As @Adrian Frühwirth notes, to write portable (across zsh
, ksh
, bash
) code with this, you need to execute setopt KSH_ARRAYS
in zsh
so as to make its arrays start with index 0
; as a side effect, you also have to use the ${...[]}
syntax when referencing arrays, as in ksh
and bash
).
Applied to our example we'd get:
# zsh: make arrays behave like in ksh/bash: start at *0*
[[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS
reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]}
Shell function:
# SYNOPSIS
# reMatch string regex
# DESCRIPTION
# Multi-shell implementation of the =~ regex-matching operator;
# works in: bash, ksh, zsh
#
# Matches STRING against REGEX and returns exit code 0 if they match.
# Additionally, the matched string(s) is returned in array variable ${reMatch[@]},
# which works the same as bash's ${BASH_REMATCH[@]} variable: the overall
# match is stored in the 1st element of ${reMatch[@]}, with matches for
# capture groups (parenthesized subexpressions), if any, stored in the remaining
# array elements.
# NOTE: zsh arrays by default start with index *1*.
# EXAMPLE:
# reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[@]} == ('This AND that.', 'This', 'that')
function reMatch {
typeset ec
unset -v reMatch # initialize output variable
[[ $1 =~ $2 ]] # perform the regex test
ec=$? # save exit code
if [[ $ec -eq 0 ]]; then # copy result to output variable
[[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[@]}" )
[[ -n $KSH_VERSION ]] && reMatch=( "${.sh.match[@]}" )
[[ -n $ZSH_VERSION ]] && reMatch=( "$MATCH" "${match[@]}" )
fi
return $ec
}
Note:
function reMatch
(as opposed to reMatch()
) is used to declare the function, which is required for ksh
to truly create local variables with typeset
.I would not use cut
, since you cannot specify more than one delimiter.
If your grep
supports PCRE
, then you can do:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ grep -oP '(?<=cell=)[^;]+' <<< "$string"
ABC
You can use sed
, which in simple terms can be done as -
$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string"
ABC
Another option is to use awk
. With that you can do the following by specifying list of delimiters you want to consider as field separators:
$ awk -F'[;= ]' '{print $5}' <<< "$string"
ABC
You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field.
Note that if your shell does not support here-string notation <<<
then you can echo
the variable and pipe it to the command.
$ echo "$string" | cmd
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With