Wordnet synsets using perl

Tags:

I installed Wordnet::Similarity and Wordnet::QueryData as an easy way to calculate information content score and probability that comes with these modules. But I'm stuck at this basic problem: given a word, print n words similar to it - which should not be difficult that iterating through the synsets and doing join.

using the wn command and piping it with a whole lot of tr, sort | uniq I can get all the words:

 wn cat -synsn | grep -v Sense | tr '=' ' ' | tr '>' ' ' | tr '\t' ' ' | tr ',' '\n' | sort | uniq

OUTPUT

8 senses of cat                                                         
adult female
adult male
African tea
Arabian tea
big cat
bozo
cat
cat
CAT
Caterpillar
cat-o'-nine-tails
 computed axial tomography
computed tomography
computerized axial tomography
computerized tomography
CT
excitant
felid
      feline
      gossip
gossiper
gossipmonger
guy
hombre
kat
khat
      man
newsmonger
qat
quat
rumormonger
rumourmonger
      stimulant
stimulant drug
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun cat
      tracked vehicle
true cat
      whip
      woman
X-radiation
      X-raying

but its kinda nasty,and needs further clean up.

What my script looks like is below, and what I want to get is all the words in cat#n1...8.

SCRIPT

use WordNet::QueryData;

my $wn = WordNet::QueryData->new( noload => 1);

print "Senses: ", join(", ", $wn->querySense("cat#n")), "\n";
print "Synset: ", join(", ", $wn->querySense("cat", "syns")), "\n";
print "Hyponyms: ", join(", ", $wn->querySense("cat#n#1", "hypo")), "\n";

OUTPUT:

Senses: cat#n#1, cat#n#2, cat#n#3, cat#n#4, cat#n#5, cat#n#6, cat#n#7, cat#n#8
Synset: cat#n, cat#v
Hyponyms: domestic_cat#n#1, wildcat#n#3

SCRIPT

use WordNet::QueryData;
my $wn = WordNet::QueryData->new;

foreach $word (qw/cat#n/) {

    @senses = $wn->querySense($word);

    foreach $wps (@senses) {
            @gloss = $wn -> querySense($wps, "syns");
            print "$wps : @gloss\n";
    }

}

OUTPUT:

cat#n#1 : cat#n#1 true_cat#n#1
cat#n#2 : guy#n#1 cat#n#2 hombre#n#1 bozo#n#2
cat#n#3 : cat#n#3
cat#n#4 : kat#n#1 khat#n#1 qat#n#1 quat#n#1 cat#n#4 Arabian_tea#n#1 African_tea#n#1
cat#n#5 : cat-o'-nine-tails#n#1 cat#n#5
cat#n#6 : Caterpillar#n#2 cat#n#6
cat#n#7 : big_cat#n#1 cat#n#7
cat#n#8 : computerized_tomography#n#1 computed_tomography#n#1 CT#n#2 computerized_axial_tomography#n#1 computed_axial_tomography#n#1 CAT#n#8

P.S. I have never written perl before, but have been looking into perl scripts since morning - and can now understand the basic stuff. Just need to know if there is cleaner way to do this using the api docs - couldn't figure out from the api or usergroup archives.

Update:

I think I'll settle with:

 wn cat -synsn | sed '1,6d' |sed 's/Sense [[:digit:]]//g' | sed 's/[[:space:]]*=> //' | sed '/^$/d'

sed rocks!

879

asked Aug 15 '11 22:08

Tathagata

1 Answers

I think you'll find the following hepful...

http://marimba.d.umn.edu/WordNet-Pairs/

What are the N most similar words to X, according to WordNet?

This data seeks to answer that question, where similarity is based on measures from WordNet::Similarity. http://wn-similarity.sourceforge.net

-------------- verb data

These files were created with WordNet::Similarity version 2.05 using WordNet 3.0. They show all the pairwise verb-verb similarities found in WordNet according to the path, wup, lch, lin, res, and jcn measures. The path, wup, and lch are path-based, while res, lin, and jcn are based on information content.

As of March 15, 2011 pairwise measures for all verbs using the six measures above are availble, each in their own .tar file. Each *.tar file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx 2.0 - 2.4 GB compressed. In each of these .tar files you will find 25,047 files, one for each verb sense. Each file consists of 25,048 lines, where each line (except the first) contains a WordNet verb sense and the similarity to the sense featured in that particular file. Doing the math here, you find that each .tar file contains about 625,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have a bit more than 300 million unique values.

-------------- noun data

As of August 19, 2011 pairwise measures for all nouns using the path measure are available. This file is named WordNet-noun-noun-path-pairs.tar. It is approximately 120 GB compressed. In this file you will find 146,312 files, one for each noun sense. Each file consists of 146,313 lines, where each line (except the first) contains a WordNet noun sense and the similarity to the sense featured in that particular file. Doing the math here, you find that each .tar file contains about 21,000,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion unique values.

We are currently running wup, res, and lesk, but do not have an estimated date of availability yet.

answered Oct 11 '22 08:10

Ted Pedersen

Related questions
                            
                                Garbage collection in Perl threads
                            
                                Why is my Perl program failing with Tie::File and Unicode/UTF-8 encoding?
                            
                                Why is lookahead (sometimes) faster than capturing?
                            
                                Best way to write an init.d script for start_server and starman?
                            
                                Swap keyboard numbers to symbols
                            
                                binmode + mod_perl 2.0.5 + Parse::RecDescent = segmentaion fault
                            
                                Date::Manip Not Installing
                            
                                Perl6: getc in raw mode
                            
                                Can SQLite DB files be made read-only?
                            
                                Sending a signal to a perl script while it is closing a filehandle [duplicate]
                            
                                Create a VSTRING from a scalar variable without using eval
                            
                                Find out which scripts are calling a perl package
                            
                                Moose::Error::Croak error reporting not from perspective of caller
                            
                                Match over multiple lines perl regular expression
                            
                                Why was this regex calling substcont an excessive number of times?
                            
                                Why doesnt SIGINT get caught here?
                            
                                How to import Apache access log into MySQL table?
                            
                                How to verify normal termination of R scripts executed from Perl?
                            
                                How to tell CPAN (Perl) about packages created with meta-programming?
                            
                                "Dynamic" routes in Mojolicious

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Wordnet synsets using perl

Tags:

perl

wordnet