How to combine all the words of a sentence extracted with a regex?

Question

I would like to combine with a linux command, if possible, all the words that start with a capital letter, excluding the one at the beginning of the line. The goal is to create edges between these words. For example:

My friend John met Beatrice and Lucio.

The result I would like to have should be:

John, Beatrice
John, Lucio
Beatrice, Lucio

I managed to get all the words that start with a capital letter, excluding the word at the beginning of the line through a regex. The regex is:

*cat gov.json | grep -oP "\b([A-Z][a-z']*)(\s[A-Z][a-z']*)*\b | ^(\s*.*?\s).*" > nodes.csv*

The nodes managed to enter them individually in column, ie:

John
Beatrice
Lucio

The goal now is to create the possible combinations between names that start with a capital letter and put them into a file. Any suggestions?

Ed Morton · Accepted Answer

If order of the pairs in the output doesn't matter:

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+"; OFS=", " }
{
    for (i=2; i<=NF; i++) {
        if ($i ~ /^[[:upper:]]/) {
            words[$i]
        }
    }
}
END {
    for (word1 in words) {
        for (word2 in words) {
            if (word1 != word2) {
                print word1, word2
            }
        }
        delete words[word1]
    }
}

$ awk -f tst.awk file
Beatrice, Lucio
Beatrice, John
Lucio, John

If the order does matter then:

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]"; OFS=", " }
{
    for (i=2; i<=NF; i++) {
        if ($i ~ /^[[:upper:]]/) {
            if ( !seen[$i]++ ) {
                words[++numWords] = $i
            }
        }
    }
}
END {
    for (word1nr=1; word1nr<=numWords; word1nr++) {
        word1 = words[word1nr]
        for (word2nr=word1nr+1; word2nr<=numWords; word2nr++) {
            word2 = words[word2nr]
            print word1, word2
        }
    }
}

$ awk -f tst.awk file
John, Beatrice
John, Lucio
Beatrice, Lucio

In the above, file contains the original input, e.g. My friend John met Beatrice and Lucio.

How to combine all the words of a sentence extracted with a regex?

Tags:

regex

linux

awk

Emanuele Bosimini

1 Answers

Ed Morton

Recent Activity

Donate For Us

How to combine all the words of a sentence extracted with a regex?

Tags:

regex

linux

awk

Emanuele Bosimini

1 Answers

Ed Morton

Related questions

Recent Activity

Donate For Us