Stanford NER prop file meaning of DistSim

Question

In one of the example .prop files coming with the Stanford NER software there are two options I do not understand:

useDistSim = true
distSimLexicon = /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters

Does anyone have a hint what DistSim stands for and where I can find any more documentation on how to use these options?

UPDATE: I just found out that DistSim means distributional similarity. I still wonder what that means in this context.

Christopher Manning · Accepted Answer

"DistSim" refers to using features based on word classes/clusters, built using distributional similarity clustering methods (e.g., Brown clustering, exchange clustering). Word classes group words which are similar, semantically and/or syntactically, and allow an NER system to generalize better, including handling words not in the training data of the NER system better. Many of our distributed models use a distributional similarity clustering features as well as word identity features, and gain significantly from doing so. In Stanford NER, there are a whole bunch of flags/properties that affect how distributional similarity is interpreted/used: useDistSim, distSimLexicon, distSimFileFormat, distSimMaxBits, casedDistSim, numberEquivalenceDistSim, unknownWordDistSimClass, and you need to look at the code in NERFeatureFactory.java to decode the details, but in the simple case, you just need the first two, and they need to be used while training the model, as well as at test time. The default format of the lexicon is just a text file with a series of lines with two tab separated columns of word clusterName. The cluster names are arbitrary.

Stanford NER prop file meaning of DistSim

Tags:

nlp

stanford-nlp

named-entity-recognition

titusn

1 Answers

Christopher Manning

Recent Activity

Donate For Us

Stanford NER prop file meaning of DistSim

Tags:

nlp

stanford-nlp

named-entity-recognition

titusn

1 Answers

Christopher Manning

Related questions

Recent Activity

Donate For Us