Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match a "something or nothing" in a bash regex?

Tags:

regex

bash

I'm having trouble capturing the digits in a string of this format (t|b|bug_|task_|)1234 using a bash regex. The below doesn't work:

[[ $current_branch =~ ^(t|b|bug_|task_|)([0-9]+) ]]

But once I change it to something like this:

[[ $current_branch =~ ^(t|b|bug_|task_)([0-9]+) ]]

it works, but of course its wrong because it doesn't cover the case where there are no prefixes. I realize in this case I could do

[[ $current_branch =~ ^(t|b|bug_|task_)?([0-9]+) ]]

and achieve the same result, but I'd like to know why the 2nd example doesn't work. That regex seems to work fine in Ruby, for example.

(This is on GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin11), OSX Lion)

like image 809
Suan Avatar asked May 14 '12 03:05

Suan


1 Answers

I'm certain that the difference between working and non-working versions of the regex are based in the different ways of reading regex (7). I'm going to quote the whole relevant part, because I think it goes to the heart of your issue:


Regular expressions ("RE"s), as defined in POSIX.2, come in two forms: modern REs (roughly those of egrep; POSIX.2 calls these "extended" REs) and obsolete REs (roughly those of ed(1); POSIX.2 "basic" REs). Obsolete REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. POSIX.2 leaves some aspects of RE syntax and semantics open; "(!)" marks decisions on these aspects that may not be fully portable to other POSIX.2 implementations.

A (modern) RE is one(!) or more nonempty(!) branches, separated by '|'. It matches anything that matches one of the branches.

A branch is one(!) or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.

A piece is an atom possibly followed by a single(!) '*', '+', '?', or bound. An atom followed by '*' matches a sequence of 0 or more matches of the atom. An atom followed by '+' matches a sequence of 1 or more matches of the atom. An atom followed by '?' matches a sequence of 0 or 1 matches of the atom.

A bound is '{' followed by an unsigned decimal integer, possibly followed by ',' possibly followed by another unsigned decimal integer, always followed by '}'. The integers must lie between 0 and RE_DUP_MAX (255(!)) inclusive, and if there are two of them, the first may not exceed the second. An atom followed by a bound containing one integer i and no comma matches a sequence of exactly i matches of the atom. An atom followed by a bound containing one integer i and a comma matches a sequence of i or more matches of the atom. An atom followed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the atom.

An atom is a regular expression enclosed in "()" (matching a match for the regular expression), an empty set of "()" (matching the null string)(!), a bracket expression (see below), '.' (matching any single character), '^' (matching the null string at the beginning of a line), '$' (matching the null string at the end of a line), a '\' followed by one of the characters "^.[$()|*+?{\" (matching that character taken as an ordinary character), a '\' followed by any other character(!) (matching that character taken as an ordinary character, as if the '\' had not been present(!)), or a single character with no other significance (matching that character). A '{' followed by a character other than a digit is an ordinary character, not the beginning of a bound(!). It is illegal to end an RE with '\'.


OK, there's quite a lot here to unpack. First of all, note that the "(!)" symbol means that there is an open or non-portable issue.

The essential issue is in the very next paragraph:

A (modern) RE is one(!) or more nonempty(!) branches, separated by '|'.

Your case is that you have an empty branch. As you can see from the "(!)", the empty branch is an open or non-portable issue. I think this is why it works on some systems but not on others. (I tested it on Cygwin 4.1.10(4)-release, and it did not work, then on Linux 3.2.25(1)-release, and it did. The two systems have equivalent, but not identical, man pages for regex7.)

Assuming that branches must be nonempty, a branch can be a piece, which can be an atom.

An atom can be "an empty set of "()" (matching the null string)(!)". <sarcasm>Well, that's really helpful.</sarcasm> So, POSIX specifies a regular expression for the empty string, i.e. (), but also appends a "(!)", to say that this is an open issue, or not portable.

Since what you're looking for is a branch that matches the empty string, try

[[ $current_branch =~ ^(t|b|bug_|task_|())([0-9]+) ]]

which uses the () regex to match the empty string. (This worked for me in my Cygwin 4.1.10(4)-release shell, where your original regex didn't.)

However, while (hopefully) this suggestion will work for you in your current setup, there's no guarantee that it will be portable. Sorry to disappoint.

like image 111
JXG Avatar answered Nov 15 '22 06:11

JXG