train spacy for text classification

Tags:

python

spacy

After reading the docs and doing the tutorial I figured I'd make a small demo. Turns out my model does not want to train. Here's the code

import spacy
import random
import json

TRAINING_DATA = [
    ["My little kitty is so special", {"KAT": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT": False}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
    ["The iPhone 8 reviews are here", {"KAT": False}],
    ["Noa is a great cat name.", {"KAT": True}],
    ["We got a new kitten!", {"KAT": True}]
]

nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")
nlp.add_pipe(category)
category.add_label("KAT")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [{"textcat": [entities]} for text, entities in batch]
        nlp.update(texts, annotations, losses=losses)
    if itn % 20 == 0:
        print(losses)

When I run this the output suggests that very little is learned.

{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}

This feels wrong. There should be an error or a meaningful tag. The predictions confirm this.

for text, d in TRAINING_DATA:
    print(text, nlp(text).cats)

# Dude, Totally, Yeah, Video Games {'KAT': 0.45303162932395935}
# The iPhone 8 reviews are here {'KAT': 0.45303162932395935}
# Noa is a great cat name. {'KAT': 0.45303162932395935}
# Should I pay $1,000 for the iPhone X? {'KAT': 0.45303162932395935}
# We got a new kitten! {'KAT': 0.45303162932395935}
# My little kitty is so special {'KAT': 0.45303162932395935}

It feels like my code is missing something but I can't figure out what.

274

asked May 23 '19 19:05

cantdutchthis

2 Answers

If you update and use spaCy 3 - the code above will no longer work. The solution is to migrate with some changes. I've modified the example from cantdutchthis accordingly.

Summary of changes:

use the config to change the architecture. The old default was "bag of words", the new default is "text ensemble" which uses attention. Keep this in mind when tuning the models
labels now need to be one-hot encoded
the add_pipe interface has changed slightly
nlp.update now requires an Example object rather than a tuple of text, annotation

import spacy
# Add imports for example, as well as textcat config...
from spacy.training import Example
from spacy.pipeline.textcat import single_label_bow_config, single_label_default_config
from thinc.api import Config
import random

# labels should be one-hot encoded
TRAINING_DATA = [
    ["My little kitty is so special", {"KAT0": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT1": True}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT1": True}],
    ["The iPhone 8 reviews are here", {"KAT1": True}],
    ["Noa is a great cat name.", {"KAT0": True}],
    ["We got a new kitten!", {"KAT0": True}]
]


# bow
# config = Config().from_str(single_label_bow_config)

# textensemble with attention
config = Config().from_str(single_label_default_config)

nlp = spacy.blank("en")
# now uses `add_pipe` instead
category = nlp.add_pipe("textcat", last=True, config=config)
category.add_label("KAT0")
category.add_label("KAT1")


# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=4):
        texts = [nlp.make_doc(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]

        # uses an example object rather than text/annotation tuple
        examples = [Example.from_dict(doc, annotation) for doc, annotation in zip(
            texts, annotations
        )]
        nlp.update(examples, losses=losses)
    if itn % 20 == 0:
        print(losses)

109

answered Sep 28 '22 04:09

chappers

Based on the comment from Ines, this is the answer.

import spacy
import random
import json

TRAINING_DATA = [
    ["My little kitty is so special", {"KAT": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT": False}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
    ["The iPhone 8 reviews are here", {"KAT": False}],
    ["Noa is a great cat name.", {"KAT": True}],
    ["We got a new kitten!", {"KAT": True}]
]

nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")
category.add_label("KAT")
nlp.add_pipe(category)

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=1):
        texts = [nlp(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]
        nlp.update(texts, annotations, losses=losses)
    if itn % 20 == 0:
        print(losses)

answered Sep 28 '22 04:09

cantdutchthis

Related questions
                            
                                PySpark: filtering with isin returns empty dataframe
                            
                                How to make Altair plots responsive
                            
                                Pandas specifying custom holidays
                            
                                Encounter: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
                            
                                How do I install and run Pyright from the CLI instead of using VS Code?
                            
                                Compare content of two pandas dataframes even if the rows are differently ordered
                            
                                Numpy taking only first character of string
                            
                                Django: How to check if data is correct before saving it to a database on a post request?
                            
                                TypeError: 'str' object is not callable using Selenium through Python
                            
                                How to configure a tor proxy on windows?
                            
                                Is there a way to label multiple 3d surfaces in matplotlib?
                            
                                What's the fastest way to read images from urls?
                            
                                matplotlib: assigning different hatch to bars
                            
                                Should I balance the test set when i have highly unbalanced data?
                            
                                In python, how to 'if finditer(...) has no matches'?
                            
                                Why does `categorical_feature` of lightgbm not work?
                            
                                String format printing with python3: print from unpacked array *some* of the time
                            
                                Cyclic permutation operators in python
                            
                                Comparing two potentially NULL values in SQLite query
                            
                                Graphene: Enum argument doesn't seem to work

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

train spacy for text classification

Tags:

python

spacy

cantdutchthis

People also ask

2 Answers

chappers

cantdutchthis

Recent Activity

Donate For Us