Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove punctuation marks using awk?

Tags:

shell

unix

awk

I need a command line in shell that given a text file "novel" displays in a single line each word along with the number of line it corresponds, writing it in a file called "words". The problem is the words can't have punctuation marks. This is what I have

$ awk '{for(i=1; i<=NF; ++i) {printf $i "\t" NR "\n", $0 > "words"}}' novel

The file contains:

$ cat novel 
ver a don Quijote, y ellas le defendían la puerta:
-¿Qué quiere este mostrenco en esta casa?

Expected output:

ver 1
a 1
don 1
Quijote 1
...
puerta 1
Qué 2
...
casa 2

It's a very simple command for academic use.

like image 831
Alex Martinez Avatar asked Feb 08 '18 05:02

Alex Martinez


1 Answers

Using awk

Try this command:

awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words

As an example, consider this file:

$ cat novel
It was a "dark" and stormy
night; the rain fell in torrents.

$ awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel
It
was
a
dark
and
stormy
night
the
rain
fell
in
torrents

Or, to save the output in file words, use:

awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words

How it works:

  • gsub(/[[:punct:]]/, "")

    This tells awk to find any punctuation and replace it with an empty string.

    [:punct:] is a character class that includes all punctuation. This form includes all punctuation defined by unicode. Unicode defines, for example, many types of quote characters. This will include them all.

  • 1

    This is awk's shorthand for print-the-record.

  • RS='[[:space:]]'

    This tells awk to use any sequence of whitespace as the record separator. This means that each word defines a separated record and awk will read in one word as a time for processing.

Counting the words

The usual approach for counting items in Unix to use sort and uniq -c as follows:

$ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' | sort | uniq -c
      1 one
      3 three
      2 two

Alternatively, awk can do it all:

$ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, ""); a[$0]++} END{for (w in a) print w,a[w]}' RS='[[:space:]]'
three 3
two 2
one 1

Alternate awk method

Andriy Makukha suggests that we might not want to remove punctuation from within a word like the single quote in I've. Similarly, we might not want to remove the periods from within a URL so that google.com stays google.com. To remove punctuation only if it is at the beginning or end of a word, we would replace the gsub command with:

gsub(/^[[:punct:]]|[[:punct:]]$/, "")

For example:

$ echo "I've got 'google.com'" | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]'
I've
got
google.com

Using sed

This sed command will remove all punctuation and put each word on a separate line:

sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel

If we run our command on it, we obtain:

$ sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel
It
was
a
dark
and
stormy
night
the
rain
fell
in
torrents

If you want the words saved in file words, then try:

sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel >words

__How it works:_

  • s/[[:punct:]]//g

    This tells sed to find any occurrence of punctuation and replace it with nothing. Again, we use [:punct:] because it will handle all the unicode-defined punctuation characters.

  • s/[[:space:]]/\n/g

    This tells sed to find any sequence of whitespace and replace it with a single newline.

like image 194
John1024 Avatar answered Nov 15 '22 06:11

John1024