I just want to get the number of a file that may or may not be gzip'd. However, it appears that a regular expression in sed does not support a ?
. Here's what I tried:
echo 'file_1.gz'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and nothing was returned. Then I added a ?
to the string being analyzed:
echo 'file_1.gz?'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and got:
1
So, it looks like the ?
used in most regex's is not supported in sed, right? Well then, I would just like sed to give a 1
for file_1
and file_1.gz
. What's the best way to do that in a bash script if execution time is critical?
sed used as a verb:To edit a file or stream of text using sed.
\b marks a word boundary, either beginning or end. Now consider \b' . This matches a word boundary followed by a ' . Since ' is not a word character, this means that the end of word must precede the ' to match.
env | sed '/^#/ d' | sed '/^$/ d' Concatenate FILE(s), or standard input, to standard output. With no FILE, or when FILE is -, read standard input.
When in doubt, echo the command: echo sed "s/\//\\\//g" -> sed s/\//\\//g . Btw you can use something else for sed, like 's@/@\\/@g' .
The equivalent to x?
is \(x\|\)
.
However, many versions of sed support an option to enable "extended regular expressions" which includes ?
. In GNU sed the flag is -r
. Note that this also changes unescaped parens to do grouping. eg:
echo 'file_1.gz'|sed -n -r 's/.*_(.*)(\.gz)?/\1/p'
Actually, there's another bug in your regex which is that the greedy .*
in the parens is going to swallow up the ".gz" if there is one. sed doesn't have a non-greedy equivalent to *
as far as I know, but you can use |
to work around this. |
in sed (and many other regex implementations) will use the leftmost match that works, so you can do something like this:
echo 'file_1.gz'|sed -r 's/(.*_(.*)\.gz)|(.*_(.*))/\2\4/'
This tries to match with .gz, and only tries without it if that doesn't work. Only one of group 2 or 4 will actually exist (since they are on opposite sides of the same |
) so we just concatenate them to get the value we want.
If you're looking for an answer to the specific example given in the question, or why it uses the ?
incorrectly (regardless of syntax), see the answer by Laurence Gonsalves.
If you're looking instead for the answer to the general question of why ?
doesn't exhibit its special meaning in sed as you might expect:
By default, sed uses the " POSIX basic regular expressions syntax", so the question mark must be escaped as \?
to apply its special meaning, otherwise it matches a literal question mark. As an alternative, you can use the -r
or --regexp-extended
option to use the "extended regular expression syntax", which reverses the meaning of escaped and non-escaped special characters, including ?
.
In the words of the GNU sed documentation (view by running 'info sed' on Linux):
The only difference between basic and extended regular expressions is in the behavior of a few characters: '?', '+', parentheses, and braces ('{}'). While basic regular expressions require these to be escaped if you want them to behave as special characters, when using extended regular expressions you must escape them if you want them to match a literal character.
and the option is explained:
-r
--regexp-extended
Use extended regular expressions rather than basic regular expressions. Extended regexps are those that `egrep' accepts; they can be clearer because they usually have less backslashes, but are a GNU extension and hence scripts that use them are not portable.
Update
Newer versions of GNU sed now say this:
-E
-r
--regexp-extended
Use extended regular expressions rather than basic regular expressions. Extended regexps are those that 'egrep' accepts; they can be clearer because they usually have fewer backslashes. Historically this was a GNU extension, but the '-E' extension has since been added to the POSIX standard (http://austingroupbugs.net/view.php?id=528), so use '-E' for portability. GNU sed has accepted '-E' as an undocumented option for years, and *BSD seds have accepted '-E' for years as well, but scripts that use '-E' might not port to other older systems.
So, if you need to preserve compatibility with ancient GNU sed, stick with -r
. But if you prefer better cross-platform portability on more modern systems (e.g. Linux+Mac support), go with -E
(but note that there are still some quirks and differences between GNU sed and BSD sed, so you'll have to make sure your scripts are portable in any case).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With