Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing punctuation and tabs with sed

Tags:

bash

macos

sed

tr

I am using the following to remove punctuation, tabs, and convert uppercase text to lowercase in a text file.

sed 's/[[:punct:]]//g' $HOME/file.txt | sed $'s/\t//g' | tr '[:upper:]' '[:lower:]'

Do I need to use these two separate sed commands to remove punctuation and tabs or can this be done with a single sed command?

Also, could someone explain what the $ is doing in the second sed command? Without it the command doesn't remove tabs. I looked in the man page but I didn't see anything that mentioned this.

The input file looks like this:

Pochemu oni ne v shkole?
Kto tam?
Otkuda eto moloko?
Chei chai ona p’et?
    Kogda vy chitaete?
    Kogda ty chitaesh’?
like image 875
I0_ol Avatar asked Feb 08 '17 08:02

I0_ol


People also ask

How do I remove punctuation in bash?

You can pass those special characters to bash by (1) using single quote to let special characters be treated like normal characters, or (2) escaping the special characters. text(Whatever ad! ":) is an parameter that I typed in terminal, and the command to remove punctuation is edited in a file.

What is G option in SED?

Substitution command In some versions of sed, the expression must be preceded by -e to indicate that an expression follows. The s stands for substitute, while the g stands for global, which means that all matching occurrences in the line would be replaced.

How do you use SED multiple times?

You can tell sed to carry out multiple operations by just repeating -e (or -f if your script is in a file). sed -i -e 's/a/b/g' -e 's/b/d/g' file makes both changes in the single file named file , in-place.


1 Answers

A single sed with multiple -e expressions, which can be done as below for FreeBSD sed

sed -e $'s/\t//g' -e "s/[[:punct:]]\+//g" -e 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/' file

With the y quanitifier for,

[2addr]y/string1/string2/
      Replace all occurrences of characters in string1 in the pattern 
      space with the corresponding characters from string2.

If in GNU sed, \L quantifier for lower-case conversion should work fine.

sed -e $'s/\t//g' -e "s/[[:punct:]]\+//g" -e "s/./\L&/g" 

$'' is a bash quoting mechanism to enable ANSI C-like escape sequences.

like image 177
Inian Avatar answered Oct 19 '22 15:10

Inian