I want to train a TextCategorizer model with the following (text, label)
pairs.
Label COLOR:
Label ANIMAL:
I am copying the example code in the documentation for TextCategorizer.
textcat = TextCategorizer(nlp.vocab)
losses = {}
optimizer = nlp.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
The doc variables will presumably be just nlp("The door is brown.")
and so on. What should be in gold1
and gold2
? I'm guessing they should be GoldParse objects, but I don't see how you represent text categorization information in those.
According to this example train_textcat.py it should be something like {'cats': {'ANIMAL': 0, 'COLOR': 1}}
if you want to train a multi-label model. Also, if you have only two classes, you can simply use {'cats': {'ANIMAL': 1}}
for label ANIMAL and {'cats': {'ANIMAL': 0}}
for label COLOR.
You can use the following minimal working example for a one category text classification;
import spacy
nlp = spacy.load('en')
train_data = [
(u"That was very bad", {"cats": {"POSITIVE": 0}}),
(u"it is so bad", {"cats": {"POSITIVE": 0}}),
(u"so terrible", {"cats": {"POSITIVE": 0}}),
(u"I like it", {"cats": {"POSITIVE": 1}}),
(u"It is very good.", {"cats": {"POSITIVE": 1}}),
(u"That was great!", {"cats": {"POSITIVE": 1}}),
]
textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
textcat.add_label('POSITIVE')
optimizer = nlp.begin_training()
for itn in range(100):
for doc, gold in train_data:
nlp.update([doc], [gold], sgd=optimizer)
doc = nlp(u'It is good.')
print(doc.cats)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With