Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using ? with sed

Tags:

linux

bash

sed

I just want to get the number of a file that may or may not be gzip'd. However, it appears that a regular expression in sed does not support a ?. Here's what I tried:

echo 'file_1.gz'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'

and nothing was returned. Then I added a ? to the string being analyzed:

echo 'file_1.gz?'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'

and got:

1

So, it looks like the ? used in most regex's is not supported in sed, right? Well then, I would just like sed to give a 1 for file_1 and file_1.gz. What's the best way to do that in a bash script if execution time is critical?

like image 529
User1 Avatar asked Dec 03 '10 17:12

User1


People also ask

How do you use the word sed?

sed used as a verb:To edit a file or stream of text using sed.

What does \b mean in sed?

\b marks a word boundary, either beginning or end. Now consider \b' . This matches a word boundary followed by a ' . Since ' is not a word character, this means that the end of word must precede the ' to match.

What does sed '/ $/ D do?

env | sed '/^#/ d' | sed '/^$/ d' Concatenate FILE(s), or standard input, to standard output. With no FILE, or when FILE is -, read standard input.

How do you use slash in sed?

When in doubt, echo the command: echo sed "s/\//\\\//g" -> sed s/\//\\//g . Btw you can use something else for sed, like 's@/@\\/@g' .


2 Answers

The equivalent to x? is \(x\|\).

However, many versions of sed support an option to enable "extended regular expressions" which includes ?. In GNU sed the flag is -r. Note that this also changes unescaped parens to do grouping. eg:

echo 'file_1.gz'|sed -n -r 's/.*_(.*)(\.gz)?/\1/p' 

Actually, there's another bug in your regex which is that the greedy .* in the parens is going to swallow up the ".gz" if there is one. sed doesn't have a non-greedy equivalent to * as far as I know, but you can use | to work around this. | in sed (and many other regex implementations) will use the leftmost match that works, so you can do something like this:

echo 'file_1.gz'|sed -r 's/(.*_(.*)\.gz)|(.*_(.*))/\2\4/' 

This tries to match with .gz, and only tries without it if that doesn't work. Only one of group 2 or 4 will actually exist (since they are on opposite sides of the same |) so we just concatenate them to get the value we want.

like image 199
Laurence Gonsalves Avatar answered Sep 25 '22 04:09

Laurence Gonsalves


If you're looking for an answer to the specific example given in the question, or why it uses the ? incorrectly (regardless of syntax), see the answer by Laurence Gonsalves.

If you're looking instead for the answer to the general question of why ? doesn't exhibit its special meaning in sed as you might expect:

By default, sed uses the " POSIX basic regular expressions syntax", so the question mark must be escaped as \? to apply its special meaning, otherwise it matches a literal question mark. As an alternative, you can use the -r or --regexp-extended option to use the "extended regular expression syntax", which reverses the meaning of escaped and non-escaped special characters, including ?.

In the words of the GNU sed documentation (view by running 'info sed' on Linux):

The only difference between basic and extended regular expressions is in the behavior of a few characters: '?', '+', parentheses, and braces ('{}'). While basic regular expressions require these to be escaped if you want them to behave as special characters, when using extended regular expressions you must escape them if you want them to match a literal character.

and the option is explained:

-r --regexp-extended

Use extended regular expressions rather than basic regular expressions. Extended regexps are those that `egrep' accepts; they can be clearer because they usually have less backslashes, but are a GNU extension and hence scripts that use them are not portable.

Update

Newer versions of GNU sed now say this:

-E -r --regexp-extended

Use extended regular expressions rather than basic regular expressions. Extended regexps are those that 'egrep' accepts; they can be clearer because they usually have fewer backslashes. Historically this was a GNU extension, but the '-E' extension has since been added to the POSIX standard (http://austingroupbugs.net/view.php?id=528), so use '-E' for portability. GNU sed has accepted '-E' as an undocumented option for years, and *BSD seds have accepted '-E' for years as well, but scripts that use '-E' might not port to other older systems.

So, if you need to preserve compatibility with ancient GNU sed, stick with -r. But if you prefer better cross-platform portability on more modern systems (e.g. Linux+Mac support), go with -E (but note that there are still some quirks and differences between GNU sed and BSD sed, so you'll have to make sure your scripts are portable in any case).

like image 28
amichair Avatar answered Sep 25 '22 04:09

amichair