Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any hints on how to train a nn dependency parser on a new corpus?

Tags:

stanford-nlp

We'd like to train the Stanford NN dependency parser on a Russian corpus, are there any hints on how to do it? The hyper-parameters are described in the paper, however it would be nice to understand how to prepare the training data (Annotations, and specifically how to create word2vec annotations). Any help or a reference to some document is greatly appreciated!

Thanks!

like image 617
randomsurfer_123 Avatar asked Sep 27 '22 20:09

randomsurfer_123


1 Answers

Here are some answers:

  • the site for word2vec if you want to build vector representations for Russian:

    https://code.google.com/p/word2vec/

  • the dependencies need to be in the CoNLL-X format:

    http://ilk.uvt.nl/conll/#dataformat

    The word embeddings should be in this format (each word vector on its own line):

    WORD\tn0 n1 n2 n3 n4 ...

    for instance:

    apple .45242 .392323 .111423 .999334

    put your embeddings in a file called russian_embeddings.txt

  • the training command (assumes your word vectors have dimension=50)

    java edu.stanford.nlp.parser.nndep.DependencyParser -tlp edu.stanford.nlp.trees.international.RussianTreebankLanguagePack -trainFile russian/train.conll -devFile russian/dev.conll -embedFile russian_embeddings.txt -embeddingSize 50 -model nndep.russian.model.txt.gz
    
  • A big complication is that as of the moment, edu.stanford.nlp.trees.international.RussianTreebankLanguagePack does not exist, so you will have to create this class and model it after the TreebankLanguagePacks for other languages ; if you look around in the package edu.stanford.nlp.trees.international , you can see what these TreebankLanguagePack files look like for other languages (note: the French one is only 143 lines long, so making a similar class for Russian is not out of the question at all) ; I will consult with other group members and see if I can get some clarity on what you'd have to do to complete this task

  • There are a lot of challenges to building this Russian NN dependency parse model. If you would like more help please let me know. I will talk to the developers of the NN parser and see if I can give you more advice, these answers are meant as a starting point!

like image 166
StanfordNLPHelp Avatar answered Sep 30 '22 08:09

StanfordNLPHelp