Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Training custom dataset with translate model

Running the model out of the box generates these files in the data dir :

ls
dev-v2.tgz                            newstest2013.en
giga-fren.release2.fixed.en           newstest2013.en.ids40000
giga-fren.release2.fixed.en.gz        newstest2013.fr
giga-fren.release2.fixed.en.ids40000  newstest2013.fr.ids40000
giga-fren.release2.fixed.fr           training-giga-fren.tar
giga-fren.release2.fixed.fr.gz        vocab40000.from
giga-fren.release2.fixed.fr.ids40000  vocab40000.to

Reading the src of translate.py :

https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/translate.py

tf.app.flags.DEFINE_string("from_train_data", None, "Training data.")
tf.app.flags.DEFINE_string("to_train_data", None, "Training data.")

To utilize my own training data I created dirs my-from-train-data & to-from-train-data and add my own training data to each of these dirs, training data is contained in the files mydata.from & mydata.to

my-to-train-data contains mydata.from
my-from-train-data contains mydata.to

I could not find documentation as to using own training data or what format it should take so I inferred this from the translate.py src and contents of data dir created when executing translate model out of the box.

Contents of mydata.from :

 Is this a question

Contents of mydata.to :

 Yes!

I then attempt to train the model using :

python translate.py --from_train_data my-from-train-data --to_train_data my-to-train-data

This returns with an error :

tensorflow.python.framework.errors_impl.NotFoundError: my-from-train-data.ids40000

Appears I need to create file my-from-train-data.ids40000 , what should it's contents be ? Is there an example of how to train this model using custom data ?

like image 528
blue-sky Avatar asked Feb 02 '17 18:02

blue-sky


2 Answers

blue-sky

Great question, training a model on your own data is way more fun than using the standard data. An example of what you could put in the terminal is:

python translate.py --from_train_data mydatadir/to_translate.in --to_train_data mydatadir/to_translate.out --from_dev_data mydatadir/test_to_translate.in --to_dev_data mydatadir/test_to_translate.out --train_dir train_dir_model --data_dir mydatadir

What goes wrong in your example is that you are not pointing to a file, but to a folder. from_train_data should always point to a plaintext file, whose rows should be aligned with those in the to_train_data file.

Also: as soon as you run this script with sensible data (more than one line ;) ), translate.py will generate your ids (40.000 if from_vocab_size and to_vocab_size are not set). Important to know is that this file is created in the folder specified by data_dir... if you do not specify one this means they are generated in /tmp (I prefer them at the same place as my data).

Hope this helps!

like image 110
rmeertens Avatar answered Oct 07 '22 19:10

rmeertens


Quick answer to :

Appears I need to create file my-from-train-data.ids40000 , what should it's contents be ? Is there an example of how to train this model using custom data ?

Yes, that's the vocab/ word-id file missing, which is generated when preparing to create the data.

Here is a tutorial from the Tesnorflow documentation.

quick over-view of the files and why you might be confused by the files outputted vs what to use:

  • python/ops/seq2seq.py: >> Library for building sequence-to-sequence models.
  • models/rnn/translate/seq2seq_model.py: >> Neural translation sequence-to-sequence model.
  • models/rnn/translate/data_utils.py: >> Helper functions for preparing translation data.
  • models/rnn/translate/translate.py: >> Binary that trains and runs the translation model.

The Tensorflow translate.py file requires several files to be generated when using your own corpus to translate.

  1. It needs to be aligned, meaning: language line 1 in file 1. <> language line 1 file 2. This allows the model to do encoding and decoding.

  2. You want to make sure the Vocabulary have been generated from the dataset using this file: Check these steps:

python translate.py --data_dir [your_data_directory] --train_dir [checkpoints_directory] --en_vocab_size=40000 --fr_vocab_size=40000

Note! If the Vocab-size is lower, then change that value.

There is a longer discussion here tensorflow/issues/600

If all else fails, check out this ByteNet implementation in Tensorflow which does translation task as well.

like image 43
0bserver07 Avatar answered Oct 07 '22 18:10

0bserver07