Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mallet CRF SimpleTagger Performance Tuning

A question for anyone who has used the Java library Mallet's SimpleTagger class for Conditional Random Fields (CRF). Assume that I'm already using the multi-thread option for the maximum number of CPUs I have available (this is the case): where would I start, and would kind of things should I try if I need it to run faster?

A related question is whether there is a way to do something similar to Stochastic Gradient Descent, which would speed up the training process?

The type of training I want to do is simple:

Input:
Feature1 ... FeatureN SequenceLabel
...

Test Data:
Feature1 ... FeatureN
...

Output:

Feature1 ... FeatureN SequenceLabel
...

(Where features are the output of processing I have done on the data in my own code.)

I've had problems getting any CRF classifier other than Mallet to approximately work, but I may have to backtrack again and revisit one of the other implementations, or try a new one.

like image 959
rplevy Avatar asked Mar 28 '11 13:03

rplevy


1 Answers

Yes, stochastic gradient descent is usually way faster than the L-BFGS optimizer used in Mallet. I would suggest you try CRFSuite, which you can train either by SGD or L-BFGS. You could also give Léon Bottou's SGD-based implementation a try, but that is more difficult to setup.

Otherwise, I believe that CRF++ is the most used CRF software around. It is based on L-BFGS though, so it might not be fast enough for you.

Both CRFSuite and CRF++ should be easy to get started with.

Note that all of these will be slow if you have a large number of labels. At least CRFSuite can be configured to only take into account observed label-n-grams - in an (n-1)th order model - which will typically make training and prediction much faster.

like image 169
Oscar Täckström Avatar answered Sep 20 '22 23:09

Oscar Täckström