Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using scikit-learn to training an NLP log linear model for NER

I wonder how to use sklearn.linear_model.LogisticRegression to train an NLP log linear model for named-entity recognition (NER).

A typical log-linear model for defines a conditional probability as follows:

enter image description here

with:

  • x: the current word
  • y: the class of a word being considered
  • f: the feature vector function, which maps a word x and a class y to a vector of scalars.
  • v: the feature weight vector

Can sklearn.linear_model.LogisticRegression train such a model?

The issue is that features depend on the class.

like image 860
Franck Dernoncourt Avatar asked Oct 20 '15 23:10

Franck Dernoncourt


People also ask

What kind of machine learning algorithms are available in scikit-learn?

Scikit-learn provides a wide range of machine learning algorithms which have a unified/consistent interface for fitting, predicting accuracy, etc. The example given below uses KNN (K nearest neighbors) classifier.

What is scikit-learn?

Learning Model Building in Scikit-learn : A Python Machine Learning Library. scikit-learn is an open source Python library that implements a range of machine learning, pre-processing, cross-validation and visualization algorithms using a unified interface.

What is a linear model in scikit-learn?

Linear Models — scikit-learn 1.0.1 documentation 1.1. Linear Models ¶ The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the features. In mathematical notation, if y ^ is the predicted value.

Can scikit-learn train named entity recognition models for NER?

Last week, we gave an introduction on Named Entity Recognition (NER) in NLTK and SpaCy. Today, we go a step further, — training machine learning models for NER using some of Scikit-Learn’s libraries. Let’s get started! The data is feature engineered corpus annotated with IOB and POS tags that can be found at Kaggle.


1 Answers

In scikit-learn 0.16 and higher, you can use the multinomial option for sklearn.linear_model.LogisticRegression to train a log-linear model (a.k.a. MaxEnt classifier, multiclass logistic regression). Currently the multinomial option is supported only by the ‘lbfgs’ and ‘newton-cg’ solvers.

Example with the Iris data set (4 features, 3 classes, 150 samples):

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import print_function
from __future__ import division

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Import data 
iris = datasets.load_iris()
X = iris.data # features
y_true = iris.target # labels

# Look at the size of the feature matrix and the label vector:
print('iris.data.shape: {0}'.format(iris.data.shape))
print('iris.target.shape: {0}\n'.format(iris.target.shape))

#  Instantiate a MaxEnt model
logreg = linear_model.LogisticRegression(C=1e5, multi_class='multinomial', solver='lbfgs')

# Train the model
logreg.fit(X, y_true)
print('logreg.coef_: \n{0}\n'.format(logreg.coef_))
print('logreg.intercept_: \n{0}'.format(logreg.intercept_))

# Use the model to make predictions
y_pred = logreg.predict(X)
print('\ny_pred: \n{0}'.format(y_pred))

# Assess the quality of the predictions
print('\nconfusion_matrix(y_true, y_pred):\n{0}\n'.format(confusion_matrix(y_true, y_pred)))
print('classification_report(y_true, y_pred): \n{0}'.format(classification_report(y_true, y_pred)))

The multinomial option for sklearn.linear_model.LogisticRegression was introduced in version 0.16:

  • Add multi_class="multinomial" option in :class:linear_model.LogisticRegression to implement a Logistic Regression solver that minimizes the cross-entropy or multinomial loss instead of the default One-vs-Rest setting. Supports lbfgs and newton-cg solvers. By Lars Buitinck_ and Manoj Kumar_. Solver option newton-cg by Simon Wu.
like image 166
Franck Dernoncourt Avatar answered Oct 12 '22 17:10

Franck Dernoncourt