Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine all the words of a sentence extracted with a regex?

Tags:

regex

linux

awk

I would like to combine with a linux command, if possible, all the words that start with a capital letter, excluding the one at the beginning of the line. The goal is to create edges between these words. For example:

My friend John met Beatrice and Lucio.

The result I would like to have should be:

  • John, Beatrice
  • John, Lucio
  • Beatrice, Lucio

I managed to get all the words that start with a capital letter, excluding the word at the beginning of the line through a regex. The regex is:

*cat gov.json | grep -oP "\b([A-Z][a-z']*)(\s[A-Z][a-z']*)*\b | ^(\s*.*?\s).*" > nodes.csv*

The nodes managed to enter them individually in column, ie:

  • John
  • Beatrice
  • Lucio

The goal now is to create the possible combinations between names that start with a capital letter and put them into a file. Any suggestions?

like image 778
Emanuele Bosimini Avatar asked Dec 10 '22 02:12

Emanuele Bosimini


1 Answers

If order of the pairs in the output doesn't matter:

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+"; OFS=", " }
{
    for (i=2; i<=NF; i++) {
        if ($i ~ /^[[:upper:]]/) {
            words[$i]
        }
    }
}
END {
    for (word1 in words) {
        for (word2 in words) {
            if (word1 != word2) {
                print word1, word2
            }
        }
        delete words[word1]
    }
}

$ awk -f tst.awk file
Beatrice, Lucio
Beatrice, John
Lucio, John

If the order does matter then:

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]"; OFS=", " }
{
    for (i=2; i<=NF; i++) {
        if ($i ~ /^[[:upper:]]/) {
            if ( !seen[$i]++ ) {
                words[++numWords] = $i
            }
        }
    }
}
END {
    for (word1nr=1; word1nr<=numWords; word1nr++) {
        word1 = words[word1nr]
        for (word2nr=word1nr+1; word2nr<=numWords; word2nr++) {
            word2 = words[word2nr]
            print word1, word2
        }
    }
}

$ awk -f tst.awk file
John, Beatrice
John, Lucio
Beatrice, Lucio

In the above, file contains the original input, e.g. My friend John met Beatrice and Lucio.

like image 84
Ed Morton Avatar answered Jan 29 '23 23:01

Ed Morton