I just want to get the number of a file that may or may not be gzip'd. However, it appears that a regular expression in sed does not support a <code>?</code>. Here's what I tried: <code>echo 'file_1.gz'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'</code> and nothing was returned. Then I added a <code>?</code> to the string being analyzed: <code>echo 'file_1.gz?'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'</code> and got: <code>1</code> So, it looks like the <code>?</code> used in most regex's is not supported in sed, right? Well then, I would just like sed to give a <code>1</code> for <code>file_1</code> and <code>file_1.gz</code>. What's the best way to do that in a bash script if execution time is critical?

The equivalent to <code>x?</code> is <code>\(x\|\)</code>. However, many versions of sed support an option to enable "extended regular expressions" which includes <code>?</code>. In GNU sed the flag is <code>-r</code>. Note that this also changes unescaped parens to do grouping. eg: <pre class="prettyprint"><code>echo 'file_1.gz'|sed -n -r 's/.*_(.*)(\.gz)?/\1/p' </code></pre> Actually, there's another bug in your regex which is that the greedy <code>.*</code> in the parens is going to swallow up the ".gz" if there is one. sed doesn't have a non-greedy equivalent to <code>*</code> as far as I know, but you can use <code>|</code> to work around this. <code>|</code> in sed (and many other regex implementations) will use the leftmost match that works, so you can do something like this: <pre class="prettyprint"><code>echo 'file_1.gz'|sed -r 's/(.*_(.*)\.gz)|(.*_(.*))/\2\4/' </code></pre> This tries to match with .gz, and only tries without it if that doesn't work. Only one of group 2 or 4 will actually exist (since they are on opposite sides of the same <code>|</code>) so we just concatenate them to get the value we want.

If you're looking for an answer to the specific example given in the question, or why it uses the <code>?</code> incorrectly (regardless of syntax), see the answer by Laurence Gonsalves. If you're looking instead for the answer to the general question of why <code>?</code> doesn't exhibit its special meaning in sed as you might expect: By default, sed uses the " POSIX basic regular expressions syntax", so the question mark must be escaped as <code>\?</code> to apply its special meaning, otherwise it matches a literal question mark. As an alternative, you can use the <code>-r</code> or <code>--regexp-extended</code> option to use the "extended regular expression syntax", which reverses the meaning of escaped and non-escaped special characters, including <code>?</code>. In the words of the GNU sed documentation (view by running 'info sed' on Linux): <blockquote> The only difference between basic and extended regular expressions is in the behavior of a few characters: '?', '+', parentheses, and braces ('{}'). While basic regular expressions require these to be escaped if you want them to behave as special characters, when using extended regular expressions you must escape them if you want them to match a literal character. </blockquote> and the option is explained: <code>-r</code> <code>--regexp-extended</code> <blockquote> Use extended regular expressions rather than basic regular expressions. Extended regexps are those that `egrep' accepts; they can be clearer because they usually have less backslashes, but are a GNU extension and hence scripts that use them are not portable. </blockquote> Update Newer versions of GNU sed now say this: <code>-E</code> <code>-r</code> <code>--regexp-extended</code> <blockquote> Use extended regular expressions rather than basic regular expressions. Extended regexps are those that 'egrep' accepts; they can be clearer because they usually have fewer backslashes. Historically this was a GNU extension, but the '-E' extension has since been added to the POSIX standard (http://austingroupbugs.net/view.php?id=528), so use '-E' for portability. GNU sed has accepted '-E' as an undocumented option for years, and *BSD seds have accepted '-E' for years as well, but scripts that use '-E' might not port to other older systems. </blockquote> So, if you need to preserve compatibility with ancient GNU sed, stick with <code>-r</code>. But if you prefer better cross-platform portability on more modern systems (e.g. Linux+Mac support), go with <code>-E</code> (but note that there are still some quirks and differences between GNU sed and BSD sed, so you'll have to make sure your scripts are portable in any case).

Using ? with sed

Tags:

linux

bash

sed

I just want to get the number of a file that may or may not be gzip'd. However, it appears that a regular expression in sed does not support a ?. Here's what I tried:

echo 'file_1.gz'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'

and nothing was returned. Then I added a ? to the string being analyzed:

echo 'file_1.gz?'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'

and got:

1

So, it looks like the ? used in most regex's is not supported in sed, right? Well then, I would just like sed to give a 1 for file_1 and file_1.gz. What's the best way to do that in a bash script if execution time is critical?

529

asked Dec 03 '10 17:12

User1

2 Answers

The equivalent to x? is \(x\|\).

However, many versions of sed support an option to enable "extended regular expressions" which includes ?. In GNU sed the flag is -r. Note that this also changes unescaped parens to do grouping. eg:

echo 'file_1.gz'|sed -n -r 's/.*_(.*)(\.gz)?/\1/p'

Actually, there's another bug in your regex which is that the greedy .* in the parens is going to swallow up the ".gz" if there is one. sed doesn't have a non-greedy equivalent to * as far as I know, but you can use | to work around this. | in sed (and many other regex implementations) will use the leftmost match that works, so you can do something like this:

echo 'file_1.gz'|sed -r 's/(.*_(.*)\.gz)|(.*_(.*))/\2\4/'

This tries to match with .gz, and only tries without it if that doesn't work. Only one of group 2 or 4 will actually exist (since they are on opposite sides of the same |) so we just concatenate them to get the value we want.

199

answered Sep 25 '22 04:09

Laurence Gonsalves

If you're looking for an answer to the specific example given in the question, or why it uses the ? incorrectly (regardless of syntax), see the answer by Laurence Gonsalves.

If you're looking instead for the answer to the general question of why ? doesn't exhibit its special meaning in sed as you might expect:

By default, sed uses the " POSIX basic regular expressions syntax", so the question mark must be escaped as \? to apply its special meaning, otherwise it matches a literal question mark. As an alternative, you can use the -r or --regexp-extended option to use the "extended regular expression syntax", which reverses the meaning of escaped and non-escaped special characters, including ?.

In the words of the GNU sed documentation (view by running 'info sed' on Linux):

The only difference between basic and extended regular expressions is in the behavior of a few characters: '?', '+', parentheses, and braces ('{}'). While basic regular expressions require these to be escaped if you want them to behave as special characters, when using extended regular expressions you must escape them if you want them to match a literal character.

and the option is explained:

-r --regexp-extended

Use extended regular expressions rather than basic regular expressions. Extended regexps are those that `egrep' accepts; they can be clearer because they usually have less backslashes, but are a GNU extension and hence scripts that use them are not portable.

Update

Newer versions of GNU sed now say this:

-E -r --regexp-extended

Use extended regular expressions rather than basic regular expressions. Extended regexps are those that 'egrep' accepts; they can be clearer because they usually have fewer backslashes. Historically this was a GNU extension, but the '-E' extension has since been added to the POSIX standard (http://austingroupbugs.net/view.php?id=528), so use '-E' for portability. GNU sed has accepted '-E' as an undocumented option for years, and *BSD seds have accepted '-E' for years as well, but scripts that use '-E' might not port to other older systems.

So, if you need to preserve compatibility with ancient GNU sed, stick with -r. But if you prefer better cross-platform portability on more modern systems (e.g. Linux+Mac support), go with -E (but note that there are still some quirks and differences between GNU sed and BSD sed, so you'll have to make sure your scripts are portable in any case).

answered Sep 25 '22 04:09

amichair

Related questions
                            
                                How does objdump manage to display source code with the -S option?
                            
                                How do I provide Ack with the directory I want to search?
                            
                                Who uses POSIX realtime signals and why?
                            
                                bc is ignoring scale option
                            
                                G++ undefined reference to class::function [duplicate]
                            
                                Direct access to hard disk with no FS from C program on Linux
                            
                                Extract object (*.o) files from an iPhone static library
                            
                                What are trade offs for "busy wait" vs "sleep"?
                            
                                linux how to add a file to a specific folder within a zip file
                            
                                Get an ini setting in PHP at the command line
                            
                                fallocate() command equivalent in OS X?
                            
                                Installing g++ 5 on Amazon Linux
                            
                                How do I edit resolv.conf? [closed]
                            
                                Installing ghc binaries on Linux (can't find libgmp.so)
                            
                                Installing Qt on linux, cannot find -lGL
                            
                                unshare --pid /bin/bash - fork cannot allocate memory
                            
                                What is the difference between xterm-color & xterm-256color?
                            
                                Replace a text with a variable [duplicate]
                            
                                Comprehensive list of rsync error codes [closed]
                            
                                How do I force a program to appear to run out of memory?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With