I would like to combine with a linux command, if possible, all the words that start with a capital letter, excluding the one at the beginning of the line. The goal is to create edges between these words. For example:
My friend John met Beatrice and Lucio.
The result I would like to have should be:
I managed to get all the words that start with a capital letter, excluding the word at the beginning of the line through a regex. The regex is:
*cat gov.json | grep -oP "\b([A-Z][a-z']*)(\s[A-Z][a-z']*)*\b | ^(\s*.*?\s).*" > nodes.csv*
The nodes managed to enter them individually in column, ie:
The goal now is to create the possible combinations between names that start with a capital letter and put them into a file. Any suggestions?
If order of the pairs in the output doesn't matter:
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+"; OFS=", " }
{
for (i=2; i<=NF; i++) {
if ($i ~ /^[[:upper:]]/) {
words[$i]
}
}
}
END {
for (word1 in words) {
for (word2 in words) {
if (word1 != word2) {
print word1, word2
}
}
delete words[word1]
}
}
$ awk -f tst.awk file
Beatrice, Lucio
Beatrice, John
Lucio, John
If the order does matter then:
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]"; OFS=", " }
{
for (i=2; i<=NF; i++) {
if ($i ~ /^[[:upper:]]/) {
if ( !seen[$i]++ ) {
words[++numWords] = $i
}
}
}
}
END {
for (word1nr=1; word1nr<=numWords; word1nr++) {
word1 = words[word1nr]
for (word2nr=word1nr+1; word2nr<=numWords; word2nr++) {
word2 = words[word2nr]
print word1, word2
}
}
}
$ awk -f tst.awk file
John, Beatrice
John, Lucio
Beatrice, Lucio
In the above, file
contains the original input, e.g. My friend John met Beatrice and Lucio.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With