Training n-gram NER with Stanford NLP

Tags:

Recently I have been trying to train n-gram entities with Stanford Core NLP. I have followed the following tutorials - http://nlp.stanford.edu/software/crf-faq.shtml#b

With this, I am able to specify only unigram tokens and the class it belongs to. Can any one guide me through so that I can extend it to n-grams. I am trying to extract known entities like movie names from chat data set.

Please guide me through in case I have mis-interpretted the Stanford Tutorials and the same can be used for the n-gram training.

What I am stuck with is the following property

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

Here the first column is the word (unigram) and the second column is the entity, for example

CHAPTER O
I   O
Emma    PERS
Woodhouse   PERS

Now that I need to train known entities (say movie names) like Hulk, Titanic etc as movies, it would be easy with this approach. But in case I need to train I know what you did last summer or Baby's day out, what is the best approach ?

205

asked Mar 25 '13 06:03

Arun A K

3 Answers

It had been a long wait here for an answer. I have not been able to figure out the way to get it done using Stanford Core. However mission accomplished. I have used the LingPipe NLP libraries for the same. Just quoting the answer here because, I think someone else could benefit from it.

Please check out the Lingpipe licencing before diving in for an implementation in case you are a developer or researcher or what ever.

Lingpipe provides various NER methods.

1) Dictionary Based NER

2) Statistical NER (HMM Based)

3) Rule Based NER etc.

I have used the Dictionary as well as the statistical approaches.

First one is a direct look up methodology and the second one being a training based.

An example for the dictionary based NER can be found here

The statstical approach requires a training file. I have used the file with the following format -

<root>
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX>  to be trained</s>
...
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX>  annotated </s>
</root>

I then used the following code to train the entities.

import java.io.File;
import java.io.IOException;

import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.corpus.parsers.Muc6ChunkParser;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;

@SuppressWarnings("deprecation")
public class TrainEntities {

    static final int MAX_N_GRAM = 50;
    static final int NUM_CHARS = 300;
    static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

    public static void main(String[] args) throws IOException {
        File corpusFile = new File("inputfile.txt");// my annotated file
        File modelFile = new File("outputmodelfile.model"); 

        System.out.println("Setting up Chunker Estimator");
        TokenizerFactory factory
            = IndoEuropeanTokenizerFactory.INSTANCE;
        HmmCharLmEstimator hmmEstimator
            = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
        CharLmHmmChunker chunkerEstimator
            = new CharLmHmmChunker(factory,hmmEstimator);

        System.out.println("Setting up Data Parser");
        Muc6ChunkParser parser = new Muc6ChunkParser();  
        parser.setHandler( chunkerEstimator);

        System.out.println("Training with Data from File=" + corpusFile);
        parser.parse(corpusFile);

        System.out.println("Compiling and Writing Model to File=" + modelFile);
        AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
    }

}

And to test the NER I used the following class

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Set;

import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.util.AbstractExternalizable;

public class Recognition {
    public static void main(String[] args) throws Exception {
        File modelFile = new File("outputmodelfile.model");
        Chunker chunker = (Chunker) AbstractExternalizable
                .readObject(modelFile);
        String testString="my test string";
            Chunking chunking = chunker.chunk(testString);
            Set<Chunk> test = chunking.chunkSet();
            for (Chunk c : test) {
                System.out.println(testString + " : "
                        + testString.substring(c.start(), c.end()) + " >> "
                        + c.type());

        }
    }
}

Code Courtesy : Google :)

154

answered Nov 17 '22 23:11

Arun A K

The answer is basically given in your quoted example, where "Emma Woodhouse" is a single name. The default models we supply use IO encoding, and assume that adjacent tokens of the same class are part of the same entity. In many circumstances, this is almost always true, and keeps the models simpler. However, if you don't want to do that you can train NER models with other label encodings, such as the commonly used IOB encoding, where you would instead label things:

Emma    B-PERSON
Woodhouse    I-PERSON

Then, adjacent tokens of the same category but not the same entity can be represented.

answered Nov 17 '22 22:11

Christopher Manning

I faced the same challenge of tagging ngram phrases for automative domain.I was looking for an efficient keyword mapping that can be used to create training files at a later stage. I ended up using regexNER in the NLP pipeline, by providing a mapping file with the regular expressions (ngram component terms) and their corresponding label. Note that there is no NER machine learning achieved in this case. Hope this information helps someone!

answered Nov 18 '22 00:11

Neethu Prem

Related questions
                            
                                avconv: make a video from a subset on images
                            
                                What's wrong with the ifstream seekg
                            
                                Rails - Where (directories) to put Models that are not Active Record
                            
                                iptables moving rule in a list
                            
                                Dealing with contours and bounding rectangle in OpenCV 2.4 - python 2.7
                            
                                Is it possible to have a variable span multiple lines in Robot Framework?
                            
                                Postgres COPY FROM csv file- No such file or directory
                            
                                ld: library not found for -lgsl
                            
                                JSHint: How do disable the check for unsafe characters for comments?
                            
                                How to use console.log() to debug a Chrome extension?
                            
                                When using multiple WHEN MATCHED statements, do they all execute, or does only one get executed?
                            
                                Ruby best practice : if not empty each do else in one operator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Training n-gram NER with Stanford NLP

Tags:

nlp

stanford-nlp

named-entity-recognition

opennlp

named-entity-extraction

Arun A K

People also ask

3 Answers

Arun A K

Christopher Manning

Neethu Prem

Recent Activity

Donate For Us