Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save Python NLTK alignment models for later use?

In Python, I'm using NLTK's alignment module to create word alignments between parallel texts. Aligning bitexts can be a time-consuming process, especially when done over considerable corpora. It would be nice to do alignments in batch one day and use those alignments later on.

from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

with open(path + "eng-taq_model.txt", 'w') as f:
    f.write(model.train(biverses, 20))  // makes empty file

Once I create a model, how can I (1) save it to disk and (2) reuse it later?

like image 540
Merchako Avatar asked May 12 '15 15:05

Merchako


People also ask

How to save machine learning models for later use in Python?

Click to sign-up and also get a free PDF Ebook version of the course. The solution is to save the data preparation object to file along with the model. For example, it is common to use the pickle framework (built-in to Python) for saving machine learning models for later use, such as saving a final model.

What is the best evaluation metric in NLTK?

Alignment( [ (0, 0), (3, 3), (1, 2), (1, 1), (1, 3)])) Precision is probably the most well known evaluation metric and it is implemented in nltk.metrics.scores.precision.

How to save a model in scikit learn?

There are two ways we can save a model in scikit learn: Pickle string: The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. pickle.dump to serialize an object hierarchy, you simply use dump ().

What is loading/restoring model in Python?

Loading or restoring the model is called Deserialization, where we restore the stream of bytes from the disk back to the Python object. Reasons why you should save your model? In case you need to recreate the Trained model. Share the model with others.


2 Answers

The immediate answer is to pickle it, see https://wiki.python.org/moin/UsingPickle

But because IBMModel1 returns a lambda function, it's not possible to pickle it with the default pickle / cPickle (see https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74 and https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104)

So we'll use dill. Firstly, install dill, see Can Python pickle lambda functions?

$ pip install dill
$ python
>>> import dill as pickle

Then:

>>> import dill
>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
...
>>> exit()

To use pickled model:

>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> with open('model1.pk', 'rb') as fin:
...     ibm = pickle.load(fin)
... 
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

If you try to pickle the IBMModel1 object, which is a lambda function, you'll end up with this:

>>> import cPickle as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle function objects

(Note: the above code snippet comes from NLTK version 3.0.0)

In python3 with NLTK 3.0.0, you will also face the same problem because IBMModel1 returns a lambda function:

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('mode1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
_pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'

>>> import dill
>>> with open('model1.pk', 'wb') as fout:
...     dill.dump(ibm, fout)
... 
>>> exit()

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from nltk.corpus import comtrans
>>> with open('model1.pk', 'rb') as fin:
...     ibm = dill.load(fin)
... 
>>> bitexts = comtrans.aligned_sents()[:100]
>>> aligned_sent = ibm.aligned(bitexts[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'IBMModel1' object has no attribute 'aligned'
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

(Note: In python3, pickle is cPickle, see http://docs.pythonsprints.com/python3_porting/py-porting.html)

like image 196
alvas Avatar answered Nov 10 '22 11:11

alvas


You discuss saving the aligner model, but your question seems to be more about saving the aligned bitexts that you have aligned: "It would be nice to do alignments in batch one day and use those alignments later on." I'm going to answer this question.

In the nltk environment, the best way to use a corpus-like resource it to access it with a corpus reader. The NLTK doesn't come with corpus writers, but the format supported by the NLTK's AlignedCorpusReader is very easy to generate: (NLTK 3 version)

model = ibm(biverses, 20)  # As in your question

out = open("folder/newalignedtext.txt", "w")
for pair in biverses:
    asent = model.align(pair)
    out.write(" ".join(asent.words)+"\n")
    out.write(" ".join(asent.mots)+"\n")
    out.write(str(asent.alignment)+"\n")

out.close()

That's it. You can later reload and use your aligned sentences exactly as you'd use the comtrans corpus:

from nltk.corpus.reader import AlignedCorpusReader

mycorpus = AlignedCorpusReader(r"folder", r".*\.txt")
biverses_reloaded = mycorpus.aligned_sents()

As you can see, you don't need the aligner object itself. The aligned sentences can be loaded with a corpus reader, and the aligner itself is pretty useless unless you want to study the embedded probabilities.

Comment: I'm not sure I would call the aligner object a "model". In NLTK 2, the aligner is not set up to align new text-- it doesn't even have an align() method. In NLTK 3 the function align() can align new text but only if used from python 2; in Python 3 it is broken, apparently because of the tightened rules for comparing objects of different types. If nevertheless you want to be able to pickle and reload the aligner, I'll be happy to add it to my answer; from what I've seen it can be done with vanilla cPickle.

like image 27
alexis Avatar answered Nov 10 '22 12:11

alexis