I've got this little script in <code>sh</code> (Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point: <pre class="prettyprint"><code>files="*.jpg" for f in $files do echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*' name=$? echo $name done </code></pre> So far (obviously, to you shell gurus) <code>$name</code> merely holds 0, 1 or 2, depending on if <code>grep</code> found that the filename matched the matter provided. What I'd like is to capture what's inside the parens <code>([a-z]+)</code> and store that to a variable. I'd like to use <code>grep</code> only, if possible. If not, please no Python or Perl, etc. <code>sed</code> or something like it – I'm new to shell and would like to attack this from the *nix purist angle. Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I <code>cat $name '.jpg'</code>? Please explain what's going on, if you've got the time.

If you're using Bash, you don't even have to use <code>grep</code>: <pre class="prettyprint"><code>files="*.jpg" regex="[0-9]+_([a-z]+)_[0-9a-z]*" for f in $files # unquoted in order to allow the glob to expand do if [[ $f =~ $regex ]] then name="${BASH_REMATCH[1]}" echo "${name}.jpg" # concatenate strings name="${name}.jpg" # same thing stored in a variable else echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files fi done </code></pre> It's better to put the regex in a variable. Some patterns won't work if included literally. This uses <code>=~</code> which is Bash's regex match operator. The results of the match are saved to an array called <code>$BASH_REMATCH</code>. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match. You should be aware that without anchors, this regex (and the one using <code>grep</code>) will match any of the following examples and more, which may not be what you're looking for: <pre class="prettyprint"><code>123_abc_d4e5 xyz123_abc_d4e5 123_abc_d4e5.xyz xyz123_abc_d4e5.xyz </code></pre> To eliminate the second and fourth examples, make your regex like this: <pre class="prettyprint"><code>^[0-9]+_([a-z]+)_[0-9a-z]* </code></pre> which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this: <pre class="prettyprint"><code>^[0-9]+_([a-z]+)_[0-9a-z]*$ </code></pre> then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well. If you have GNU <code>grep</code> (around 2.5 or later, I think, when the <code>\K</code> operator was added): <pre class="prettyprint"><code>name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg </code></pre> The <code>\K</code> operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is <code>(?<=)</code> - the pattern would be included before the closing parenthesis. You must use <code>\K</code> if quantifiers may match strings of different lengths (e.g. <code>+</code>, <code>*</code>, <code>{2,4}</code>). The <code>(?=)</code> operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result. In order to make the match case-insensitive, the <code>(?i)</code> operator is used. It affects the patterns that follow it so its position is significant. The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.

This isn't really possible with pure <code>grep</code>, at least not generally. But if your pattern is suitable, you may be able to use <code>grep</code> multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like <code>cut</code> and <code>sed</code> are far better at this). Suppose for the sake of argument that your pattern was a bit simpler: <code>[0-9]+_([a-z]+)_</code> You could extract this like so: <pre class="prettyprint"><code>echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+' </code></pre> The first <code>grep</code> would remove any lines that didn't match your overall patern, the second <code>grep</code> (which has <code>--only-matching</code> specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want. (Aside: Personally I'd use <code>grep</code> + <code>cut</code> to achieve what you are after: <code>echo $name | grep {pattern} | cut -d _ -f 2</code>. This gets <code>cut</code> to parse the line into fields by splitting on the delimiter <code>_</code>, and returns just field 2 (field numbers start at 1)). Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that <code>grep</code> + <code>sed</code> etc is a more Unixy way of doing things :-)

Capturing Groups From a Grep RegEx

Tags:

grep

bash

shell

I've got this little script in sh (Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:

files="*.jpg" for f in $files     do         echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'         name=$?         echo $name     done

So far (obviously, to you shell gurus) $name merely holds 0, 1 or 2, depending on if grep found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+) and store that to a variable.

I'd like to use grep only, if possible. If not, please no Python or Perl, etc. sed or something like it – I'm new to shell and would like to attack this from the *nix purist angle.

Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'?

Please explain what's going on, if you've got the time.

994

asked Dec 12 '09 00:12

Isaac

2 Answers

If you're using Bash, you don't even have to use grep:

files="*.jpg" regex="[0-9]+_([a-z]+)_[0-9a-z]*" for f in $files    # unquoted in order to allow the glob to expand do     if [[ $f =~ $regex ]]     then         name="${BASH_REMATCH[1]}"         echo "${name}.jpg"    # concatenate strings         name="${name}.jpg"    # same thing stored in a variable     else         echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files     fi done

It's better to put the regex in a variable. Some patterns won't work if included literally.

This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.

You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:

123_abc_d4e5 xyz123_abc_d4e5 123_abc_d4e5.xyz xyz123_abc_d4e5.xyz

To eliminate the second and fourth examples, make your regex like this:

^[0-9]+_([a-z]+)_[0-9a-z]*

which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:

^[0-9]+_([a-z]+)_[0-9a-z]*$

then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.

If you have GNU grep (around 2.5 or later, I think, when the \K operator was added):

name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg

The \K operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=) - the pattern would be included before the closing parenthesis. You must use \K if quantifiers may match strings of different lengths (e.g. +, *, {2,4}).

The (?=) operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.

In order to make the match case-insensitive, the (?i) operator is used. It affects the patterns that follow it so its position is significant.

The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.

155

answered Sep 19 '22 05:09

Dennis Williamson

This isn't really possible with pure grep, at least not generally.

But if your pattern is suitable, you may be able to use grep multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut and sed are far better at this).

Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_ You could extract this like so:

echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'

The first grep would remove any lines that didn't match your overall patern, the second grep (which has --only-matching specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.

(Aside: Personally I'd use grep + cut to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2. This gets cut to parse the line into fields by splitting on the delimiter _, and returns just field 2 (field numbers start at 1)).

Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep + sed etc is a more Unixy way of doing things :-)

answered Sep 19 '22 05:09

RobM

Related questions
                            
                                if, elif, else statement issues in Bash
                            
                                What does -z mean in Bash? [duplicate]
                            
                                How to put a line comment for a multi-line command [duplicate]
                            
                                Pseudo-terminal will not be allocated because stdin is not a terminal
                            
                                How can I format my grep output to show line numbers at the end of the line, and also the hit count?
                            
                                How to find the last field using 'cut'
                            
                                How to check the exit status using an if statement
                            
                                Unlimited Bash History [closed]
                            
                                How to urlencode data for curl command?
                            
                                What is the purpose of "&&" in a shell command?
                            
                                Get program execution time in the shell
                            
                                How can I get the current user's username in Bash?
                            
                                .bashrc at ssh login
                            
                                How to determine SSL cert expiration date from a PEM encoded certificate?
                            
                                Get just the filename from a path in a Bash script [duplicate]
                            
                                How can I kill a process by name instead of PID, on Linux? [duplicate]
                            
                                Highlight Bash/shell code in Markdown files
                            
                                Bash command to sum a column of numbers [duplicate]
                            
                                ./configure : /bin/sh^M : bad interpreter [duplicate]
                            
                                Setting an environment variable before a command in Bash is not working for the second command in a pipe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With