Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to train Stanford NER system to recognize more named entities types?

I'm using some NLP libraries now, (stanford and nltk) Stanford I saw the demo part but just want to ask if it possible to use it to identify more entity types.

So currently stanford NER system (as the demo shows) can recognize entities as person(name), organization or location. But the organizations recognized are limited to universities or some, big organizations. I'm wondering if I can use its API to write program for more entity types, like if my input is "Apple" or "Square" it can recognize it as a company.

Do I have to make my own training dataset?

Further more, if I ever want to extract entities and their relationships between each other, I feel I should use the stanford dependency parser. I mean, extract first the named entities and other parts tagged as "noun" and find relations between them.

Am I correct.

Thanks.

like image 796
JudyJiang Avatar asked Mar 03 '14 22:03

JudyJiang


People also ask

How do you train a spaCy NER model?

First , load the pre-existing spacy model you want to use and get the ner pipeline through get_pipe() method. Next, store the name of new category / entity type in a string variable LABEL . Now, how will the model know which entities to be classified under the new label ? You will have to train the model with examples.

Which model is best for named entity recognition?

The trained NER model will learn to label entities not only from the pre-labelled training data. It will learn to find and recognise entities also depending on the given context.

Why is named entity recognition difficult?

Ambiguity and Abbreviations -One of the major challenges in identifying named entities is language. Recognizing words which can have multiple meanings or words that can be a part of different sentences. Another major challenge is classifying similar words from texts.

What are the important techniques in named entity recognition?

The three major approaches to NER are lexicon, rules, and machine learning. Lexicon-based approaches utilize a lexicon or gazette constructed from external knowledge sources to match chunks of the text with entity names. Rule-based systems construct rules manually or automatically and use them for entity detection.


2 Answers

Yes, you need your own training set. The pre-trained Stanford models only recognise the word "Stanford" as a named entity because they have been trained on data that had that word (or very similar words according to the feature set they use, I don't know what that is) marked as a named entity.

Once you have more data, you need to put it in the right format described in this question and the Stanford tutorial.

like image 174
mbatchkarov Avatar answered Sep 21 '22 21:09

mbatchkarov


You could easily train your own corpus of data.

In the Stanford NER FAQ the first question is how to train our own model for NER

The link is http://nlp.stanford.edu/software/crf-faq.shtml

So for example You could give training data like

Product OBJ
of O
Microsoft ORG

Likewise you could build your own training data and build a model and then use it to get the desired output

like image 43
Rohan Amrute Avatar answered Sep 19 '22 21:09

Rohan Amrute