Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove punctuation from a string with exceptions using regex in bash

Using the command echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d '[:punct:]' prints the string "Jiro Inagaki Soul MediaBreeze".

However, I want to find a regular expression that will remove all punctuation except the underscore and ampersand i.e. I want "Jiro Inagaki & Soul Media_Breeze".

Following advice on character class subtraction from the sources listed at the bottom, I've tried replacing [:punct:] with the following:

  • [\p{P}\-[&_]]
  • [[:punct:]-[&_]]
  • (?![\&_])\p{P}
  • (?![\&_])[:punct:]
  • [[:punct:]-[&_]]
  • [[:punct:]&&[&_]]
  • [[:punct:]&&[^&_]]

... but I haven't gotten anything to work so far. Any help would be much appreciated!

Sources:

  • Regex: Match any punctuation character except . and _
  • https://www.rexegg.com/regex-quickstart.html
like image 631
EET FEK Avatar asked Sep 14 '25 09:09

EET FEK


2 Answers

You can specify the punctuation marks you want removed, e.g.

>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d "[.,/\\-\=\+\{\[\]\}\!\@\#\$\%\^\*\'\\\(\)]"
Jiro Inagaki & Soul Media_Breeze

Or, alternatively,

>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -dc '[:alnum:] &_'
Jiro Inagaki & Soul Media_Breeze
like image 77
jared_mamrot Avatar answered Sep 17 '25 01:09

jared_mamrot


Posting my comment as an answer as requested by @jared_mamrot.

You can manually type out the set of punctuation, excluding _, that you want to delete. I took my punctuation set from GNU docs on [:punct:]:

‘[:punct:]’ Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

You can also look at POSIX docs which says the character classes depend on locale:

punct    <exclamation-mark>;<quotation-mark>;<number-sign>;\
         <dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\
         <left-parenthesis>;<right-parenthesis>;<asterisk>;\
         <plus-sign>;<comma>;<hyphen>;<period>;<slash>;\
         <colon>;<semicolon>;<less-than-sign>;<equals-sign>;\
         <greater-than-sign>;<question-mark>;<commercial-at>;\
         <left-square-bracket>;<backslash>;<right-square-bracket>;\
         <circumflex>;<underscore>;<grave-accent>;<left-curly-bracket>;\
         <vertical-line>;<right-curly-bracket>;<tilde>
$ echo 'abcd_!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'" | tr -d '!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'"
abcd_

The set of characters in the tr command should be straightforward except for backslash, \\, which has been escaped for tr, and single quote, "'", which is being concatenated as a string quoted in double quotes, since you can't escape a single quote within single quotes.

I do prefer using @jared_marmot's complement solution, if possible, though. It is much neater.

like image 41
dosentmatter Avatar answered Sep 17 '25 01:09

dosentmatter