Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

retrieve a word after a regular expression in shell script

Tags:

regex

shell

I am trying to retrieve specific fields from a text file which has a metadata as follows:

project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN

And I have the following script for retrieving the field 'cell'

while read line
do
cell="$(echo $line | cut -d";" -f7 )"
echo  $cell
fi
done < files.txt

However the following script retrieves the whole field as cell=ABC , whereas I just want the value 'ABC' from the field, how do I retrieve the value after the regex, in the same line of code?

like image 719
AishwaryaKulkarni Avatar asked Mar 20 '14 15:03

AishwaryaKulkarni


People also ask

What is =~ in shell script?

An additional binary operator, =~, is available, with the same precedence as == and != . When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise.

Why * is used in regex?

* - means "0 or more instances of the preceding regex token"

What is Bash_rematch?

$BASH_REMATCH is an array and contains the matched text snippets. ${BASH_REMATCH[0]} contains the complete match. The remaining elements, e.g. ${BASH_REMATCH[1]} , contain the portion which were matched by () subexpressions.


2 Answers

If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running bash, ksh, or zsh, consider using the regex-matching operator, =~: [[ string =~ regex ]]:

Tip of the hat to @Adrian Frühwirth for the gist of the ksh and zsh solutions.

Sample input string:

string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'

Shell-specific use of =~ is discussed next; a multi-shell implementation of the =~ functionality via a shell function can be found at the end.


bash

The special BASH_REMATCH array variable receives the results of the matching operation: element 0 contains the entire match, element 1 the first capture group's (parenthesized subexpression's) match, and so on.

bash 3.2+:

[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'

bash 4.x:
While the specific command above works, using regex literals in bash 4.x is buggy, notably when involving word-boundary assertions \< and \> on Linux; e.g., [[ a =~ \<a ]] inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): re='\a'; [[ a =~ $re ]] works (also on bash 3.2+).

bash 3.0 and 3.1 - or after setting shopt -s compat31:
Quote the regex to make it work:

[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]}  # -> $cell == 'ABC'

ksh

The ksh syntax is the same as in bash, except:

  • the name of the special array variable that contains the matched strings is .sh.match (you must enclose the name in {...} even when just implicitly referring to the first element with ${.sh.match}):
[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC'

zsh

The zsh syntax is also similar to bash, except:

  • The regex literal must be quoted - for simplicity as a whole, or at least some shell metacharacters, such as ;.
    • you may, but needn't double-quote a regex provided as a variable value.
    • Note how this quoting behavior differs fundamentally from that of bash 3.2+: zsh requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.
  • There are 2 variables containing the matching results:
    • $MATCH contains the entire matched string
    • array variable $match contains only the matches for the capture groups (note that zsh arrays start with index 1 and that you don't need to enclose the variable name in {...} to reference array elements)
 [[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC'

Multi-shell implementation of the =~ operator as shell function reMatch

The following shell function abstracts away the differences between bash, ksh, zsh with respect to the =~ operator; the matches are returned in array variable ${reMatches[@]}.

As @Adrian Frühwirth notes, to write portable (across zsh, ksh, bash) code with this, you need to execute setopt KSH_ARRAYS in zsh so as to make its arrays start with index 0; as a side effect, you also have to use the ${...[]} syntax when referencing arrays, as in ksh and bash).

Applied to our example we'd get:

  # zsh: make arrays behave like in ksh/bash: start at *0*
[[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS

reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]}

Shell function:

# SYNOPSIS
#   reMatch string regex
# DESCRIPTION
#   Multi-shell implementation of the =~ regex-matching operator;
#   works in: bash, ksh, zsh
#
#   Matches STRING against REGEX and returns exit code 0 if they match.
#   Additionally, the matched string(s) is returned in array variable ${reMatch[@]},
#   which works the same as bash's ${BASH_REMATCH[@]} variable: the overall
#   match is stored in the 1st element of ${reMatch[@]}, with matches for
#   capture groups (parenthesized subexpressions), if any, stored in the remaining
#   array elements.
#   NOTE: zsh arrays by default start with index *1*.
# EXAMPLE:
#   reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[@]} == ('This AND that.', 'This', 'that')
function reMatch {
  typeset ec
  unset -v reMatch # initialize output variable
  [[ $1 =~ $2 ]] # perform the regex test
  ec=$? # save exit code
  if [[ $ec -eq 0 ]]; then # copy result to output variable
    [[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[@]}" )
    [[ -n $KSH_VERSION ]]  && reMatch=( "${.sh.match[@]}" )
    [[ -n $ZSH_VERSION ]]  && reMatch=( "$MATCH" "${match[@]}" )
  fi
  return $ec
}

Note:

  • function reMatch (as opposed to reMatch()) is used to declare the function, which is required for ksh to truly create local variables with typeset.
like image 187
mklement0 Avatar answered Oct 25 '22 03:10

mklement0


I would not use cut, since you cannot specify more than one delimiter.

If your grep supports PCRE, then you can do:

$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ grep -oP '(?<=cell=)[^;]+' <<< "$string"
ABC

You can use sed, which in simple terms can be done as -

$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string"
ABC

Another option is to use awk. With that you can do the following by specifying list of delimiters you want to consider as field separators:

$ awk -F'[;= ]' '{print $5}' <<< "$string"
ABC

You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field.

Note that if your shell does not support here-string notation <<< then you can echo the variable and pipe it to the command.

$ echo "$string" | cmd
like image 21
jaypal singh Avatar answered Oct 25 '22 02:10

jaypal singh