Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BASH regexp matching - including brackets in a bracketed list of characters to match against?

Tags:

regex

bash

I'm trying to do a tiny bash script that'll clean up the file and folder names of downloaded episodes of some tv shows I like. They often look like "[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE", and I basically just want to strip out that speedcd advertising bit.

It's easy enough to remove www.Speed.Cd, spaces, and dashes using regexp matching in BASH, but for the life of me, I cannot figure out how to include the brackets in a list of characters to be matched against. [- [] doesn't work, neither does [- \[], [- \\[], [- \\\[], or any number of escape characters preceding the bracket I want to remove.

Here's what I've got so far:

[[ "$newfile" =~ ^(.*)([- \[]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[- \]]*)(.*)$ ]] &&
    newfile="${BASH_REMATCH[1]}${BASH_REMATCH[4]}"

But it breaks on the brackets.

Any ideas?

TIA, Daniel :)

EDIT: I should probably note that I'm using "shopt -s nocasematch" to ensure case insensitive matching, just in case you're wondering :)

EDIT 2: Thanks to all who contributed. I'm not 100% sure which answer was to be the "correct" one, as I had several problems with my statement. Actually, the most accurate answer was just a comment to my question posted by jw013, but I didn't get it at the time because I hadn't understood yet that spaces should be escaped. I've opted for aefxx's as that one basically says the same, but with explanations :) Would've liked to put a correct answer mark on ormaaj's answer, too, as he spotted more grave issues with my expression.

Anyway, the approach I was using above, trying to match and extract the parts to keep and leave behind the unwanted ones is really not very elegant, and won't catch all cases, not even something really simple like "Some.Show.S07E14.720p.HDTV.X264-SOMEONE - [ www.Speed.Cd ]". I've instead rewritten it to match and extract just the unwanted parts and then do string replacement of those on the original string, like so (loop is in case there's multiple brandings):

# Remove common torrent site brandings, including surrounding spaces, brackets, etc.:
while [[ "$newfile" =~ ([[\ {\(-]*(www\.)?(torrentday\.com|torrenting\.com|spastikustv|speed\.cd|moviesp2p\.com|publichd\.org|publichd|scenetime\.com|kingdom-release)[]\ }\)-]*) ]]; do
    newfile=${newfile//"${BASH_REMATCH[1]}"/}
done
like image 870
DanielSmedegaardBuus Avatar asked Apr 16 '12 21:04

DanielSmedegaardBuus


3 Answers

Ok, this is the first time I've heard of the =~ operator but nevertheless here's what I found by trial and error:

if [[ $newfile =~ ^(.*)([-[:space:][]*(what|ever)[][:space:]-]*)(.*)$ ]] 
                          ^^^^^^^^^^              ^^^^^^^^^^

Looks strange but actually does work (just tested it).

EDIT
Quote from the Linux man pages regex(7):

To include a literal ] in the list, make it the first character (following a possible ^). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal aq-aq as the first endpoint of a range, enclose it in "[." and ".]" to make it a collating element (see below). With the exception of these and some combinations using aq[aq (see next paragraphs), all other special characters, including aq\aq, lose their special significance within a bracket expression.

like image 74
aefxx Avatar answered Sep 22 '22 19:09

aefxx


Whenever you're doing a regex it's most compatible between Bash versions to put regexes in a variable even if you do manage to dodge all the pitfalls of putting them directly in a test expression. http://mywiki.wooledge.org/BashPitfalls#if_.5B.5B_.24foo_.3D.2BAH4_.27some_RE.27_.5D.5D

Your current regex looks like you're trying to optionally match anything preceding the opening bracket. I'd guess you're actually trying to save for example 3 and 4 from something like this:

$ shopt -s nocasematch
$ newfile='[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE'
$ re='^.*[-[:space:][]*(www\.torrenting\.com|spastikustv|www\.speed\.cd|moviesp2p\.com)[][:space:]-]*(.*)$'
$ [[ $newfile =~ $re ]]
$ declare -p BASH_REMATCH
declare -ar BASH_REMATCH='([0]="[ www.Speed.Cd ] - Some.Show.S07E14.720p.HDTV.X264-SOMEONE" [1]="www.Speed.Cd" [2]="Some.Show.S07E14.720p.HDTV.X264-SOMEONE")'
like image 38
ormaaj Avatar answered Sep 22 '22 19:09

ormaaj


The basic issue is quite simple, if not obvious.
A BASH REGEX is totally unprotected (from the shell), and cannot be protected by "​double quotes​". This means that every literal space (and tab,etc) must be protected by a baskslash \ ... end of story. The rest is just a case of getting you regex to suit your needs.

One other thing; use [\ [] and []\ ] to match [ and ] respectively, within the range square-bracket construct (in this case along with a space).

example:

newfile="[ ]"
[[ "$newfile" =~ ^[\ []\ []\ ]$ ]] &&
    echo YES ||
    echo NO
like image 44
Peter.O Avatar answered Sep 22 '22 19:09

Peter.O