I am trying to retrieve specific fields from a text file which has a metadata as follows: <pre class="prettyprint"><code>project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN </code></pre> And I have the following script for retrieving the field <code>'cell'</code> <pre class="prettyprint"><code>while read line do cell="$(echo $line | cut -d";" -f7 )" echo $cell fi done < files.txt </code></pre> However the following script retrieves the whole field as <code>cell=ABC</code> , whereas I just want the value <code>'ABC'</code> from the field, how do I retrieve the value after the regex, in the same line of code?

If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running <code>bash</code>, <code>ksh</code>, or <code>zsh</code>, consider using the regex-matching operator, <code>=~</code>: <code>[[ string =~ regex ]]</code>: Tip of the hat to @Adrian Frühwirth for the gist of the <code>ksh</code> and <code>zsh</code> solutions. Sample input string: <pre class="prettyprint"><code>string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN' </code></pre> Shell-specific use of <code>=~</code> is discussed next; a multi-shell implementation of the <code>=~</code> functionality via a shell function can be found at the end. <hr> <h3>bash</h3> The special <code>BASH_REMATCH</code> array variable receives the results of the matching operation: element <code>0</code> contains the entire match, element <code>1</code> the first capture group's (parenthesized subexpression's) match, and so on. <code>bash 3.2+</code>: <pre class="prettyprint"><code>[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC' </code></pre> <code>bash 4.x</code>: While the specific command above works, using regex literals in bash <code>4.x</code> is buggy, notably when involving word-boundary assertions <code>\<</code> and <code>\></code> on Linux; e.g., <code>[[ a =~ \<a ]]</code> inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): <code>re='\a'; [[ a =~ $re ]]</code> works (also on bash <code>3.2+</code>). <code>bash 3.0 and 3.1</code> - or after setting <code>shopt -s compat31</code>: Quote the regex to make it work: <pre class="prettyprint"><code>[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC' </code></pre> <hr> <h3>ksh</h3> The <code>ksh</code> syntax is the same as in <code>bash</code>, except: <ul> <li>the name of the special array variable that contains the matched strings is <code>.sh.match</code> (you must enclose the name in <code>{...}</code> even when just implicitly referring to the first element with <code>${.sh.match}</code>):</li> </ul> <pre class="prettyprint"><code>[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC' </code></pre> <hr> <h3>zsh</h3> The <code>zsh</code> syntax is also similar to bash, except: <ul> <li>The regex literal must be quoted - for simplicity as a whole, or at least some shell metacharacters, such as <code>;</code>. <ul> <li>you may, but needn't double-quote a regex provided as a variable value.</li> <li>Note how this quoting behavior differs fundamentally from that of bash 3.2+: <code>zsh</code> requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.</li> </ul> </li> <li>There are 2 variables containing the matching results: <ul> <li> <code>$MATCH</code> contains the entire matched string</li> <li>array variable <code>$match</code> contains only the matches for the capture groups (note that <code>zsh</code> arrays start with index <code>1</code> and that you don't need to enclose the variable name in <code>{...}</code> to reference array elements)</li> </ul> </li> </ul> <pre class="prettyprint"><code> [[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC' </code></pre> <hr> <h3>Multi-shell implementation of the <code>=~</code> operator as shell function <code>reMatch</code> </h3> The following shell function abstracts away the differences between <code>bash</code>, <code>ksh</code>, <code>zsh</code> with respect to the <code>=~</code> operator; the matches are returned in array variable <code>${reMatches[@]}</code>. As @Adrian Frühwirth notes, to write portable (across <code>zsh</code>, <code>ksh</code>, <code>bash</code>) code with this, you need to execute <code>setopt KSH_ARRAYS</code> in <code>zsh</code> so as to make its arrays start with index <code>0</code>; as a side effect, you also have to use the <code>${...[]}</code> syntax when referencing arrays, as in <code>ksh</code> and <code>bash</code>). Applied to our example we'd get: <pre class="prettyprint lang-sh prettyprint-override"><code> # zsh: make arrays behave like in ksh/bash: start at *0* [[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]} </code></pre> Shell function: <pre class="prettyprint lang-sh prettyprint-override"><code># SYNOPSIS # reMatch string regex # DESCRIPTION # Multi-shell implementation of the =~ regex-matching operator; # works in: bash, ksh, zsh # # Matches STRING against REGEX and returns exit code 0 if they match. # Additionally, the matched string(s) is returned in array variable ${reMatch[@]}, # which works the same as bash's ${BASH_REMATCH[@]} variable: the overall # match is stored in the 1st element of ${reMatch[@]}, with matches for # capture groups (parenthesized subexpressions), if any, stored in the remaining # array elements. # NOTE: zsh arrays by default start with index *1*. # EXAMPLE: # reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[@]} == ('This AND that.', 'This', 'that') function reMatch { typeset ec unset -v reMatch # initialize output variable [[ $1 =~ $2 ]] # perform the regex test ec=$? # save exit code if [[ $ec -eq 0 ]]; then # copy result to output variable [[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[@]}" ) [[ -n $KSH_VERSION ]] && reMatch=( "${.sh.match[@]}" ) [[ -n $ZSH_VERSION ]] && reMatch=( "$MATCH" "${match[@]}" ) fi return $ec } </code></pre> Note: <ul> <li> <code>function reMatch</code> (as opposed to <code>reMatch()</code>) is used to declare the function, which is required for <code>ksh</code> to truly create local variables with <code>typeset</code>.</li> </ul>

I would not use <code>cut</code>, since you cannot specify more than one delimiter. If your <code>grep</code> supports <code>PCRE</code>, then you can do: <pre class="prettyprint"><code>$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN' $ grep -oP '(?<=cell=)[^;]+' <<< "$string" ABC </code></pre> You can use <code>sed</code>, which in simple terms can be done as - <pre class="prettyprint"><code>$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string" ABC </code></pre> Another option is to use <code>awk</code>. With that you can do the following by specifying list of delimiters you want to consider as field separators: <pre class="prettyprint"><code>$ awk -F'[;= ]' '{print $5}' <<< "$string" ABC </code></pre> You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field. Note that if your shell does not support here-string notation <code><<<</code> then you can <code>echo</code> the variable and pipe it to the command. <pre class="prettyprint"><code>$ echo "$string" | cmd </code></pre>

retrieve a word after a regular expression in shell script

Tags:

regex

shell

I am trying to retrieve specific fields from a text file which has a metadata as follows:

project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN

And I have the following script for retrieving the field 'cell'

while read line
do
cell="$(echo $line | cut -d";" -f7 )"
echo  $cell
fi
done < files.txt

However the following script retrieves the whole field as cell=ABC , whereas I just want the value 'ABC' from the field, how do I retrieve the value after the regex, in the same line of code?

719

asked Mar 20 '14 15:03

AishwaryaKulkarni

2 Answers

If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running bash, ksh, or zsh, consider using the regex-matching operator, =~: [[ string =~ regex ]]:

^{Tip of the hat to @Adrian Frühwirth for the gist of the ksh and zsh solutions.}

Sample input string:

string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'

Shell-specific use of =~ is discussed next; a multi-shell implementation of the =~ functionality via a shell function can be found at the end.

bash

The special BASH_REMATCH array variable receives the results of the matching operation: element 0 contains the entire match, element 1 the first capture group's (parenthesized subexpression's) match, and so on.

bash 3.2+:

[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'

bash 4.x:
While the specific command above works, using regex literals in bash 4.x is buggy, notably when involving word-boundary assertions \< and \> on Linux; e.g., [[ a =~ \<a ]] inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): re='\a'; [[ a =~ $re ]] works (also on bash 3.2+).

bash 3.0 and 3.1 - or after setting shopt -s compat31:
Quote the regex to make it work:

[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]}  # -> $cell == 'ABC'

ksh

The ksh syntax is the same as in bash, except:

the name of the special array variable that contains the matched strings is .sh.match (you must enclose the name in {...} even when just implicitly referring to the first element with ${.sh.match}):

[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC'

zsh

The zsh syntax is also similar to bash, except:

The regex literal must be quoted - for simplicity as a whole, or at least some shell metacharacters, such as ;.
- you may, but needn't double-quote a regex provided as a variable value.
- Note how this quoting behavior differs fundamentally from that of bash 3.2+: zsh requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.
There are 2 variables containing the matching results:
- $MATCH contains the entire matched string
- array variable $match contains only the matches for the capture groups (note that zsh arrays start with index 1 and that you don't need to enclose the variable name in {...} to reference array elements)

 [[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC'

Multi-shell implementation of the `=~` operator as shell function `reMatch`

The following shell function abstracts away the differences between bash, ksh, zsh with respect to the =~ operator; the matches are returned in array variable ${reMatches[@]}.

As @Adrian Frühwirth notes, to write portable (across zsh, ksh, bash) code with this, you need to execute setopt KSH_ARRAYS in zsh so as to make its arrays start with index 0; as a side effect, you also have to use the ${...[]} syntax when referencing arrays, as in ksh and bash).

Applied to our example we'd get:

  # zsh: make arrays behave like in ksh/bash: start at *0*
[[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS

reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]}

Shell function:

# SYNOPSIS
#   reMatch string regex
# DESCRIPTION
#   Multi-shell implementation of the =~ regex-matching operator;
#   works in: bash, ksh, zsh
#
#   Matches STRING against REGEX and returns exit code 0 if they match.
#   Additionally, the matched string(s) is returned in array variable ${reMatch[@]},
#   which works the same as bash's ${BASH_REMATCH[@]} variable: the overall
#   match is stored in the 1st element of ${reMatch[@]}, with matches for
#   capture groups (parenthesized subexpressions), if any, stored in the remaining
#   array elements.
#   NOTE: zsh arrays by default start with index *1*.
# EXAMPLE:
#   reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[@]} == ('This AND that.', 'This', 'that')
function reMatch {
  typeset ec
  unset -v reMatch # initialize output variable
  [[ $1 =~ $2 ]] # perform the regex test
  ec=$? # save exit code
  if [[ $ec -eq 0 ]]; then # copy result to output variable
    [[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[@]}" )
    [[ -n $KSH_VERSION ]]  && reMatch=( "${.sh.match[@]}" )
    [[ -n $ZSH_VERSION ]]  && reMatch=( "$MATCH" "${match[@]}" )
  fi
  return $ec
}

Note:

function reMatch (as opposed to reMatch()) is used to declare the function, which is required for ksh to truly create local variables with typeset.

187

answered Oct 25 '22 03:10

mklement0

I would not use cut, since you cannot specify more than one delimiter.

If your grep supports PCRE, then you can do:

$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ grep -oP '(?<=cell=)[^;]+' <<< "$string"
ABC

You can use sed, which in simple terms can be done as -

$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string"
ABC

Another option is to use awk. With that you can do the following by specifying list of delimiters you want to consider as field separators:

$ awk -F'[;= ]' '{print $5}' <<< "$string"
ABC

You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field.

Note that if your shell does not support here-string notation <<< then you can echo the variable and pipe it to the command.

$ echo "$string" | cmd

answered Oct 25 '22 02:10

jaypal singh

Related questions
                            
                                Removing the url from text using java
                            
                                Regex to match 10-15 digit number
                            
                                JavaScript regex to split numbers and letters
                            
                                Multiple regex matches in Google Sheets formula
                            
                                Regular expressions performance: Boost vs. Perl
                            
                                PHP remove everything before last instance of a character
                            
                                Regex that matches anything except for all whitespace
                            
                                Regex/code for removing "FWD", "RE", etc, from email subject
                            
                                Split on non arabic characters
                            
                                How to use a variable inside a RegEx pattern? [duplicate]
                            
                                delete extra blank lines in emacs
                            
                                replace all "\" with "\\" python
                            
                                how to use a regular expression to extract json fields?
                            
                                How to validate if a string would be a valid java variable?
                            
                                regex to verify UTC date time format
                            
                                Replace exact substring in python [duplicate]
                            
                                Regex for getting text between the last brackets ()
                            
                                How to extract substring in parentheses using Regex pattern
                            
                                How to match something with regex that is not between two special characters?
                            
                                Python regular expression re.match, why this code does not work? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

retrieve a word after a regular expression in shell script

Tags:

regex

shell

AishwaryaKulkarni

People also ask

2 Answers

bash

ksh

zsh

Multi-shell implementation of the `=~` operator as shell function `reMatch`

mklement0

jaypal singh

Recent Activity

Donate For Us

retrieve a word after a regular expression in shell script

Tags:

regex

shell

AishwaryaKulkarni

People also ask

2 Answers

bash

ksh

zsh

Multi-shell implementation of the =~ operator as shell function reMatch

mklement0

jaypal singh

Related questions

Recent Activity

Donate For Us

Multi-shell implementation of the `=~` operator as shell function `reMatch`