How to use sklearn Pipeline with custom Features?

Tags:

I am doing text classification using Python and sklearn. I have some custom Features which I use in addition to vectorizers. I would like to know whether it is possible to use them with sklearn Pipeline and how the features will be stacked in it.

A short example of my current code for the classification without the Pipeline. Please, tell me if you see that anything is wrong in it, will be very grateful for you help. Is it possible to use it with the sklearn Pipeline in some way? I have created my own function get_features() which extracts the custom features, transforms the vectorizer, scales the features and finally stacks all of them.

import sklearn.svm
import re
from sklearn import metrics
import numpy
import scipy.sparse
import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.preprocessing import StandardScaler

# custom feature example
def words_capitalized(sentence):
    tokens = []
    # tokenize the sentence
    tokens = word_tokenize(sentence)

    counter = 0
    for word in tokens:

        if word[0].isupper():
            counter += 1

    return counter

# custom feature example
def words_length(sentence):
    tokens = []
    # tokenize the sentence
    tokens = word_tokenize(sentence)

    list_of_length = list()
    for word in tokens:
        list_of_length.append(length(word))

    return list_of_length

def get_features(untagged_text, value, scaler):

    # this function extracts the custom features
    # transforms the vectorizer
    # scales the features
    # and finally stacks all of them

    list_of_length = list()
    list_of_capitals = list()

    # transform vectorizer
    X_bow = countVecWord.transform(untagged_text)

    # I also see some people use X_bow = countVecWord.transform(untagged_text).todense(), what does the .todense() option do here?

    for sentence in untagged_text:
        list_of_urls.append([words_length(sentence)])
        list_of_capitals.append([words_capitalized(sentence)])

    # turn the feature output into a numpy vector
    X_length = numpy.array(list_of_urls)
    X_capitals = numpy.array(list_of_capitals)

    if value == 1:
        # fit transform for training set
        X_length = = scaler.fit_transform(X_length)
        X_capitals = scaler.fit_transform(X_capitals)
    # if test set
    else:
        # transform only for test set
        X_length = = scaler.transform(X_length)
        X_capitals = scaler.transform(X_capitals)

    # stack all features as a sparse matrix
    X_two_bows = scipy.sparse.hstack((X_bow, X_length))
    X_two_bows = scipy.sparse.hstack((X_two_bows , X_length))
    X_two_bows = scipy.sparse.hstack((X_two_bows , X_capitals))

    return X_two_bows

def fit_and_predict(train_labels, train_features, test_features, classifier):

    # fit the training set
    classifier.fit(train_features, train_labels)

    # return the classification result
    return classifier.predict(test_features)

if  __name__ == '__main__':

    input_sets = read_data()

    X = input_sets[0] 
    Y = input_sets[1] 
    X_dev = input_sets[2] 
    Y_dev = input_sets[3] 

    # initialize the count vectorizer
    countVecWord = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1, 3))

    scaler= StandardScaler()

    # extract features

    # for training
    X_total = get_features(X, 1, scaler)

    # for dev set
    X_total_dev = get_features(X_dev,  2, scaler)

    # store labels as numpy array
    y_train = numpy.asarray(Y)
    y_dev = numpy.asarray(Y_dev)

    # train the classifier
    SVC1 = LinearSVC(C = 1.0)

    y_predicted = list()
    y_predicted = fit_and_predict(y_train, X_total, X_total_dev, SVC1)

    print "Result for dev set"
    precision, recall, f1, _ = metrics.precision_recall_fscore_support(y_dev, y_predicted)
    print "Precision: ", precision, " Recall: ", recall, " F1-Score: ", f1

I know there is FeatureUnion, but I do not know whether it can be used for my purpose and whether it will scale and hstack the features.

EDIT: This seem to be a good start: https://michelleful.github.io/code-blog/2015/06/20/pipelines/

Haven't tried it yet, will post when I do. The question now is, how I can do Feature selection with Pipelines.

359

asked Mar 19 '16 23:03

Ivan Bilan

1 Answers

For anyone interested, the custom Feature Class needs to have fit and transform functions and then can be used in FeatureUnion. For a detailed example check my other question here > How to fit different inputs into an sklearn Pipeline?

answered Nov 01 '22 15:11

Ivan Bilan

Related questions
                            
                                Matplotlib normalize colorbar (Python)
                            
                                Summary statistics on Large csv file using python pandas
                            
                                Count of unequal elements across numpy arrays
                            
                                Replacing punctuation except intra-word dashes with a space
                            
                                Should I generate *.pyc files when deploying?
                            
                                Scrapy + Splash + ScrapyJS
                            
                                Changing multiple characters by other characters in a string [duplicate]
                            
                                How can I enumerate rows in groups with Spark/Python?
                            
                                How can I get the Python compiler string programmatically?
                            
                                Multiindex only some of columns in Pandas
                            
                                Create a method attribute in a class
                            
                                Setting values with multiindex in pandas
                            
                                Docker. No such file or directory
                            
                                Messed up numpy installation - `GFORTRAN_1.4' not found bug
                            
                                Accessing rows of an array, inside an array of arrays?
                            
                                Name columns when importing csv to dataframe in dask
                            
                                Python - How to handle HTTPS request with (Urllib2 + SSL) though a HTTP proxy
                            
                                Keras - is it possible to view the weights and biases of models in Tensorboard
                            
                                Wrapping around a list as a slice operation
                            
                                Python list error: [::-1] step on [:-1] slice

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use sklearn Pipeline with custom Features?

Tags:

python

machine-learning

classification

scikit-learn

pipeline

Ivan Bilan

People also ask

1 Answers

Ivan Bilan

Recent Activity

Donate For Us