Remove the first capitalized word after a period

Question

I would like to be able to remove the first word written in capital after the period. The goal is to remove the first word written in capital letters even if the sentences on the same line are two. In fact, as I will show in the example, the first word of the line has been omitted, but the first word of the second sentence appears.

For the first sentence of the first line I solved the problem by starting the if from 2 instead of 1:

here's the code

BEGIN { FS="[^[:alpha:]']+"; OFS=" "} 
{
   parola=" "
   max_nr=0

   prec=""

   for (i=2; i<=NF; i++) {
        if ($i ~ /[[:punct:][:digit:]]+[:space:]*[A-Z][']{0,1}[A-Z]{0,1}[a-z]+/){
            continue
        }
        else{
            if ($i ~ /[A-Z][']{0,1}[A-Z]{0,1}[a-z]+/){

                if(!(prec=="")){

                    prec=prec" "$i
                }
                else{
                    prec=$i              
                }
            }     
            else {

                if(!(prec=="")){

                    words[prec]
                    prec=""    
                  }
            }

            if (i==NF) {
                max_nr=max_nr+1  
                for (word1 in words) {
                    for (word2 in words) {
                        if (word1 != word2) {
                            print parola"" word1","word2
                        }
                    }

                    delete words[word1]
                }                
            }
            }
}  
}   
END{
    print FILENAME" "FNR
    print i
    print max_nr
}

This is the content of test.txt:

Today Jonathan played soccer with Martin. After the game, Martin and Jonathan were thirsty and then drank a fresh Lemon Soda. 
Paolo went to Lisbon with an Easyjet plane. During the trip he met two of his dear friends, Peter and John.

This is the result of the command:

awk -f script.awk test.txt > output.csv

Lisbon,During
Lisbon,John
Lisbon,Peter
Lisbon,Easyjet
During,John
During,Peter
During,Easyjet
John,Peter
John,Easyjet
Peter,Easyjet
Jonathan,Martin After
Jonathan,Lemon Soda
Jonathan,Martin
Martin After,Lemon Soda
Martin After,Martin
Lemon Soda,Martin

The expected output should be:

Lisbon,John
Lisbon,Peter
Lisbon,Easyjet
John,Peter
John,Easyjet
Peter,Easyjet
Jonathan,Martin
Martin,Lemon Soda
Jonathan,Lemon Soda

Any suggestions?

Ed Morton · Accepted Answer

Not trying do do the whole job for you (I provided a solution for that previously), just solve the specific problem you asked about in this question:

You're using FS="[^[:alpha:]']+" so there's no way to tell given any field ("word") if the separator before it was a . or something else. Use FS='[.]' or similar as your starting point and then you'll know that the separator before each field was the start of line or a . and then you can use split($i,f,/[^[:alpha:]']+/) to isolate each sub-field ("word") within that field ("sentence"). e.g.:

$ cat tst.awk
BEGIN { FS="[[:space:]]*[.][[:space:]]*" }
{
    for (sentenceNr=1; sentenceNr<=NF; sentenceNr++) {
        sentence = $sentenceNr
        numWords = split(sentence,words,/[^[:alpha:]\047]+/)
        for (wordNr=2; wordNr<=numWords; wordNr++) {
            word = words[wordNr]
            if ( word ~ /^[[:upper:]]/ ) {
                print NR, sentenceNr, wordNr, word
            }
        }
    }
}

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John

Note that given this input:

$ cat file
Today Jonathan played soccer with Martin. After the game, Martin and Jonathan were thirsty and then drank a fresh Lemon Soda.
Paolo went to Lisbon with an EasyJet plane. During the trip he met two of his dear friends, Peter and John.
May lost her home. 10 Downing St is where the PM lives.

the above would output:

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John
3 2 2 Downing
3 2 3 St
3 2 7 PM

if the "Downing" shouldn't be there then change the code to:

$ cat tst.awk
BEGIN { FS="[[:space:]]*[.][[:space:]]*" }
{
    for (sentenceNr=1; sentenceNr<=NF; sentenceNr++) {
        numWords = split($sentenceNr,words,/[^[:alpha:]\047]+/)
        isSubsequent = 0
        for (wordNr=1; wordNr<=numWords; wordNr++) {
            word = words[wordNr]
            if ( word ~ /^[[:upper:]]/ ) {
                if ( isSubsequent++ ) {
                    print NR, sentenceNr, wordNr, word
                }
            }
        }
    }
}

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John
3 2 3 St
3 2 7 PM

kvantour · Answer

The following assumes that your text follows the most basic rule of punctuation. A punctuation character is followed by a space. When you have that, you can use GNU awk to extract the words you are interested in very easily by defining records and field patterns. A record is assumed to be a sentence which ends with any of the following characters .?!. Capitalized words are recognized by the pattern: [A-Z][a-z]* So now it is easy:

awk 'BEGIN{ RS="[.?!][[:space:]]*"; FPAT="([[:space:]]+[[:upper:]][[:alnum:]]*)+"}
     { print "record",NR,":",$0 }
     { for(i=1;i<=NF;++i) print "field",i,":",$i }' file

Here, we update the record separator RS to include various possible space-characters from the [[:space:]] class. This ensures that the first word will not have a space in front of it. All other capitalized words can then be picked up by checking the field patterns FPAT="([[:space:]][[:upper:]][[:alnum:]]*)+" which represents a sequence of general space separated capitalized words. Note that the fields always start with a blank or newline character. This can be easily cleaned up with a simple substitution:

This outputs:

record 1 : Today Jonathan played soccer with Martin
field 1 :  Jonathan
field 2 :  Martin
record 2 : After the game, Martin and Jonathan were thirsty and then drank a fresh Lemon Soda
field 1 :  Martin
field 2 :  Jonathan
field 3 :  Lemon Soda
record 3 : Paolo went to Lisbon with an Easyjet plane
field 1 :  Lisbon
field 2 :  Easyjet
record 4 : During the trip he met two of his dear friends, Peter and John
field 1 :  Peter
field 2 :  John

Which can now be adapted to the OP's problem (with space correction for the fields):

awk 'BEGIN{ RS="[.?!][[:space:]]*"; FPAT="([[:space:]]+[[:upper:]][[:alnum:]]*)+"}
     { for (i=1;i<=NF;++i) { 
           w=$i; gsub(/[[:space:]]+/," ",w);
           w=substr(w,2); words[w]
       }
     }
     { for (w1 in words) { 
           for (w2 in words) if(w1 != w2) print w1,w2
           delete words[w1]
        }
     }' file

returns:

Jonathan Martin
Jonathan Lemon Soda
Jonathan Martin
Lemon Soda Martin
Lisbon Easyjet
John Peter

Remove the first capitalized word after a period

Tags:

linux

bash

awk

Emanuele Bosimini

2 Answers

Ed Morton

kvantour

Recent Activity

Donate For Us

Remove the first capitalized word after a period

Tags:

linux

bash

awk

Emanuele Bosimini

2 Answers

Ed Morton

kvantour

Related questions

Recent Activity

Donate For Us