Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spacy create new language model with data from corpus

I am trying to create a new language model (Luxembourgish) in spaCy, but I am confused on how to do this.

I followed the instructions on their website and did a similar thing as in this post. But what I do not understand is, how to add data like a vocab or wordvectors. (e.g. "fill" the language template)

I get that there are some dev tools for same of these operations, but their execution is poorly documented so I do not get how to install and use them properly, especially as they seem to be in python 2.7 which clashes with my spacy installation as it uses python 3.

As for now I have a corpus.txt (from a wikipediadump) on which I want to train and a language template with the defaults like stop_words.py, tokenizer_exceptions.py etc. that I created and filled by hand.

Anyone ever done this properly and could help me here?

like image 357
pjominet Avatar asked May 03 '18 10:05

pjominet


People also ask

What does NLP () do in spaCy?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.

Is spaCy better than NLTK?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What does spaCy load (' en ') do?

Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.


1 Answers

There are three main components of a "language model" in spaCy: the "static" language-specific data shipped in Python (tokenizer exceptions, stop words, rules for mapping fine-grained to coarse-grained part-of-speech tags), the statistical model trained to predict part-of-speech tags, dependencies and named entities (trained on a large labelled corpus and included as binary weights) and optional word vectors that can be converted and added before or after training. You can also train your own vectors on your raw text using a library like Gensim and then add them to spaCy.

spaCy v2.x allows you to train all pipeline components independently or in on go, so you can train the tagger, parser and entity recognizer on your data. All of this requires labelled data. If you're training a new language from scratch, you normally use an existing treebank. Here's an example of the Universal Dependencies corpus for Spanish (which is also the one that was used to train spaCy's Spanish model). You can then convert the data to spaCy's JSON format and use the spacy train command to train a model. For example:

git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
mkdir ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.json ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.json ancora-json
mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json

I don't know what's in your corpus.txt and whether it's fully labelled or only raw text. (I also don't know of any existing resources for Luxembourgish – sounds like that's potentially quite hard to find!) If your data is labelled, you can convert it to spaCy's format using one of the built-in converters or your own little script. If your corpus consists of only raw text, you need to label it first and see if it's suitable to train a general language model. Ultimately, this comes down to experimenting – but here are some strategies:

  • Label your entire corpus manually for each component – e.g. part-of-speech tags if you want to train the tagger, dependency labels if you want to train the parser, and entity spans if you want to train the entity recognizer. You'll need a lot of data though – ideally, a corpus of a similar size to the Universal Dependencies ones.
  • Experiment with teaching an existing pre-trained model Luxembourgish – for example the German model. This might sound strange, but it's not an uncommon strategy. Instead of training from scratch, you post-train the existing model with examples of Luxembourgish (ideally until its predictions on your Luxembourgish text are good enough). You can also create more training data by running the German model over your Luxembourgish text and extracting and correcting its mistakes (see here for details).

Remember that you always need evaluation data, too (also referred to as "development data" in the docs). This is usually a random portion of your labelled data that you hold back during training and use to determine whether your model is improving.

like image 160
Ines Montani Avatar answered Oct 05 '22 23:10

Ines Montani