In one of the example .prop files coming with the Stanford NER software there are two options I do not understand:
useDistSim = true
distSimLexicon = /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters
Does anyone have a hint what DistSim stands for and where I can find any more documentation on how to use these options?
UPDATE: I just found out that DistSim means distributional similarity. I still wonder what that means in this context.
"DistSim" refers to using features based on word classes/clusters, built using distributional similarity clustering methods (e.g., Brown clustering, exchange clustering). Word classes group words which are similar, semantically and/or syntactically, and allow an NER system to generalize better, including handling words not in the training data of the NER system better. Many of our distributed models use a distributional similarity clustering features as well as word identity features, and gain significantly from doing so. In Stanford NER, there are a whole bunch of flags/properties that affect how distributional similarity is interpreted/used: useDistSim
, distSimLexicon
, distSimFileFormat
, distSimMaxBits
, casedDistSim
, numberEquivalenceDistSim
, unknownWordDistSimClass
, and you need to look at the code in NERFeatureFactory.java
to decode the details, but in the simple case, you just need the first two, and they need to be used while training the model, as well as at test time. The default format of the lexicon is just a text file with a series of lines with two tab separated columns of word clusterName
. The cluster names are arbitrary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With