I need a command line in shell that given a text file "novel" displays in a single line each word along with the number of line it corresponds, writing it in a file called "words". The problem is the words can't have punctuation marks. This is what I have
$ awk '{for(i=1; i<=NF; ++i) {printf $i "\t" NR "\n", $0 > "words"}}' novel
The file contains:
$ cat novel
ver a don Quijote, y ellas le defendían la puerta:
-¿Qué quiere este mostrenco en esta casa?
Expected output:
ver 1
a 1
don 1
Quijote 1
...
puerta 1
Qué 2
...
casa 2
It's a very simple command for academic use.
Try this command:
awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words
As an example, consider this file:
$ cat novel
It was a "dark" and stormy
night; the rain fell in torrents.
$ awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel
It
was
a
dark
and
stormy
night
the
rain
fell
in
torrents
Or, to save the output in file words
, use:
awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words
How it works:
gsub(/[[:punct:]]/, "")
This tells awk to find any punctuation and replace it with an empty string.
[:punct:]
is a character class that includes all punctuation. This form includes all punctuation defined by unicode. Unicode defines, for example, many types of quote characters. This will include them all.
1
This is awk's shorthand for print-the-record.
RS='[[:space:]]'
This tells awk to use any sequence of whitespace as the record separator. This means that each word defines a separated record and awk will read in one word as a time for processing.
The usual approach for counting items in Unix to use sort
and uniq -c
as follows:
$ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' | sort | uniq -c
1 one
3 three
2 two
Alternatively, awk can do it all:
$ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, ""); a[$0]++} END{for (w in a) print w,a[w]}' RS='[[:space:]]'
three 3
two 2
one 1
Andriy Makukha suggests that we might not want to remove punctuation from within a word like the single quote in I've
. Similarly, we might not want to remove the periods from within a URL so that google.com
stays google.com
. To remove punctuation only if it is at the beginning or end of a word, we would replace the gsub
command with:
gsub(/^[[:punct:]]|[[:punct:]]$/, "")
For example:
$ echo "I've got 'google.com'" | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]'
I've
got
google.com
This sed command will remove all punctuation and put each word on a separate line:
sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel
If we run our command on it, we obtain:
$ sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel
It
was
a
dark
and
stormy
night
the
rain
fell
in
torrents
If you want the words saved in file words
, then try:
sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel >words
__How it works:_
s/[[:punct:]]//g
This tells sed to find any occurrence of punctuation and replace it with nothing. Again, we use [:punct:]
because it will handle all the unicode-defined punctuation characters.
s/[[:space:]]/\n/g
This tells sed to find any sequence of whitespace and replace it with a single newline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With