Logo Questions Linux Laravel Mysql Ubuntu Git Menu

train spacy for text classification




After reading the docs and doing the tutorial I figured I'd make a small demo. Turns out my model does not want to train. Here's the code

import spacy
import random
import json

    ["My little kitty is so special", {"KAT": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT": False}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
    ["The iPhone 8 reviews are here", {"KAT": False}],
    ["Noa is a great cat name.", {"KAT": True}],
    ["We got a new kitten!", {"KAT": True}]

nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")

# Start the training

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [{"textcat": [entities]} for text, entities in batch]
        nlp.update(texts, annotations, losses=losses)
    if itn % 20 == 0:

When I run this the output suggests that very little is learned.

{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}

This feels wrong. There should be an error or a meaningful tag. The predictions confirm this.

for text, d in TRAINING_DATA:
    print(text, nlp(text).cats)

# Dude, Totally, Yeah, Video Games {'KAT': 0.45303162932395935}
# The iPhone 8 reviews are here {'KAT': 0.45303162932395935}
# Noa is a great cat name. {'KAT': 0.45303162932395935}
# Should I pay $1,000 for the iPhone X? {'KAT': 0.45303162932395935}
# We got a new kitten! {'KAT': 0.45303162932395935}
# My little kitty is so special {'KAT': 0.45303162932395935}

It feels like my code is missing something but I can't figure out what.

like image 274
cantdutchthis Avatar asked May 23 '19 19:05


People also ask

How is spaCy trained?

The recommended way to train your spaCy pipelines is via the spacy train command on the command line. It only needs a single config. cfg configuration file that includes all settings and hyperparameters.

What data is spaCy trained on?

Binary training format v3. The main data format used in spaCy v3. 0 is a binary format created by serializing a DocBin , which represents a collection of Doc objects. This means that you can train spaCy pipelines using the same format it outputs: annotated Doc objects.

How can I improve my spaCy NER accuracy?

Re-train spacy NER with your custom examples: If you have, for instance, a few hundred examples with real addresses, you can manually TAG it and then re-train the spacy NER to overfit your particular address. You can train a new NER from scratch or fine-tune an existing one.

2 Answers

If you update and use spaCy 3 - the code above will no longer work. The solution is to migrate with some changes. I've modified the example from cantdutchthis accordingly.

Summary of changes:

  • use the config to change the architecture. The old default was "bag of words", the new default is "text ensemble" which uses attention. Keep this in mind when tuning the models
  • labels now need to be one-hot encoded
  • the add_pipe interface has changed slightly
  • nlp.update now requires an Example object rather than a tuple of text, annotation
import spacy
# Add imports for example, as well as textcat config...
from spacy.training import Example
from spacy.pipeline.textcat import single_label_bow_config, single_label_default_config
from thinc.api import Config
import random

# labels should be one-hot encoded
    ["My little kitty is so special", {"KAT0": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT1": True}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT1": True}],
    ["The iPhone 8 reviews are here", {"KAT1": True}],
    ["Noa is a great cat name.", {"KAT0": True}],
    ["We got a new kitten!", {"KAT0": True}]

# bow
# config = Config().from_str(single_label_bow_config)

# textensemble with attention
config = Config().from_str(single_label_default_config)

nlp = spacy.blank("en")
# now uses `add_pipe` instead
category = nlp.add_pipe("textcat", last=True, config=config)

# Start the training

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=4):
        texts = [nlp.make_doc(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]

        # uses an example object rather than text/annotation tuple
        examples = [Example.from_dict(doc, annotation) for doc, annotation in zip(
            texts, annotations
        nlp.update(examples, losses=losses)
    if itn % 20 == 0:
like image 109
chappers Avatar answered Sep 28 '22 04:09


Based on the comment from Ines, this is the answer.

import spacy
import random
import json

    ["My little kitty is so special", {"KAT": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT": False}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
    ["The iPhone 8 reviews are here", {"KAT": False}],
    ["Noa is a great cat name.", {"KAT": True}],
    ["We got a new kitten!", {"KAT": True}]

nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")

# Start the training

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=1):
        texts = [nlp(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]
        nlp.update(texts, annotations, losses=losses)
    if itn % 20 == 0:
like image 25
cantdutchthis Avatar answered Sep 28 '22 04:09
