Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use spaCy to create a new entity and learn only from keyword list

I am trying to use spaCy to create a new entity categorization 'Species' with a list of species names, example can he found here.

I found a tutorial for training new entity type from this spaCy tutorial (Github code here). However, the problem is, I don't want to manually create a sentence for each species name as it would be very time consuming.

I created below training data, which looks like this:

TRAIN_DATA = [('Bombina',{'entities':[(0,6,'SPECIES')]}),
 ('Dermaptera',{'entities':[(0,9,'SPECIES')]}),
  .... 
]

The way I created the training set is: instead of providing a full sentence and the location of the matched entity, I only provide the name of each species, and the start and end index are programmatically generated:

[( 0, 6, 'SPECIES' )]

[( 0, 9, 'SPECIES' )]

Below training code is what I used to train the model. (Code copied from above hyperlink)

nlp = spacy.blank('en')  # create blank Language class

 # Add entity recognizer to model if it's not in the pipeline 
 # nlp.create_pipe works for built-ins that are registered with spaCy 
 if 'ner' not in nlp.pipe_names: 
     ner = nlp.create_pipe('ner') 
     nlp.add_pipe(ner) 
 # otherwise, get it, so we can add labels to it 
 else: 
     ner = nlp.get_pipe('ner') 

 ner.add_label(LABEL)   # add new entity label to entity recognizer


  if model is None: 
      optimizer = nlp.begin_training() 
  else: 
      # Note that 'begin_training' initializes the models, so it'll zero out 
      # existing entity types. 
      optimizer = nlp.entity.create_optimizer() 

     # get names of other pipes to disable them during training 
     other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] 
     with nlp.disable_pipes(*other_pipes):  # only train NER 
         for itn in range(n_iter): 
             random.shuffle(TRAIN_DATA) 
             losses = {} 
             for text, annotations in TRAIN_DATA: 
                 nlp.update([text], [annotations], sgd=optimizer, drop=0.35,  losses=losses) 
             print(losses) 

I'm new to NLP and spaCy please let me know if I did it correctly or not. And why my attempt failed the training (when I ran it, it throws an error).


[UPDATE]

The reason I want to feed keyword only to the training model is that, ideally, I would hope the model to learn those key words first, and once it identifies a context which contains the keyword, it will learn the associated context, and therefore, enhance the current model.

At the first glance, it is more like regex expression. But with more and more data feeding in, the model will continuous learn, and finally being able to identify new species names that previously not exists in the original training set.


Thanks, Katie

like image 425
katie lu Avatar asked May 29 '18 08:05

katie lu


Video Answer


1 Answers

The advantage of training the named entity recognizer to detect SPECIES in your text is that the model won't only be able to recognise your examples, but also generalise and recognise other species in context. If you only want to find a fixed set of terms and not more, a simpler, rule-based approach might work better for you. You can find examples and details of this here.

If you do want the model to generalise and recognise your entity type in context, you also have to show it examples of the entities in context. That's currently the problem with your training examples: you're only showing the model single words, not sentences containing the words. To get good results, the data you're training the model with needs to be as close as possible to the data you later want to analyse.

While there are other approaches for training models without or with fewer labelled examples, the most straightforward strategy for collecting training data to train your spaCy model is to... label training data. However, there are some tricks you can use to make this less painful:

  • Start with a list of species and use the Matcher or PhraseMatcher to find them in your documents. For each match, you'll get a Span object, so you can extract the start and end position of the span in the text. This easily lets you create a bunch of examples automatically. You can find some more details on this here.

  • Use word vectors to find more similar terms to the entities you're looking for, so you get more examples you can search for in your text using the above approach. I'm not sure how spaCy's vector models will do for your species, since the terms are quite specific. So if you have a large corpus of raw text containing species, you might have to train your own vectors.

  • Use a labelling or data annotation tool. There are open-source solutions like Brat, or, once you're getting more serious about annotation and training, you might also want to check out our annotation tool Prodigy, which is a modern commercial solution that integrates seamlessly with spaCy (Disclaimer: I'm one of the spaCy maintainers).

like image 200
Ines Montani Avatar answered Sep 17 '22 21:09

Ines Montani