Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Too many _coef values for LogisticRegression in Pipeline

I'm making use of the sklearn-pandas DataFrameMapper in a sklearn Pipeline. In order to evaluate feature contribution in a feature union pipeline, I like to measure the coefficients of the estimator (Logistic Regression). For the following code example, three text content columns a, b and c are vectorised and selected for X_train:

import pandas as pd
import numpy as np
import pickle
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
np.random.seed(1)

data = pd.read_csv('https://pastebin.com/raw/WZHwqLWr')
#data.columns

X = data.copy()
y = data.result
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

mapper = DataFrameMapper([
        ('a', CountVectorizer()),
        ('b', CountVectorizer()),
        ('c', CountVectorizer())
])

pipeline = Pipeline([
        ('featurize', mapper),
        ('clf', LogisticRegression(random_state=1))
        ])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(abs(pipeline.named_steps['clf'].coef_))
#array([[0.3567311 , 0.3567311 , 0.46215153, 0.10542043, 0.3567311 ,
#        0.46215153, 0.46215153, 0.3567311 , 0.3567311 , 0.3567311 ,
#        0.3567311 , 0.46215153, 0.46215153, 0.3567311 , 0.46215153,
#        0.3567311 , 0.3567311 , 0.3567311 , 0.3567311 , 0.46215153,
#        0.46215153, 0.46215153, 0.3567311 , 0.3567311 ]])

print(len(pipeline.named_steps['clf'].coef_[0]))
#24

Unlike a normal analysis of multiple features, which usually returns coefficients of equal length to the number of features, the DataFrameMapper returns a larger matrix of coefficients.

a) How is a total of 24 coefficients in the upper case explained? b) What's the best method to access the coef_ value of each feature ("a","b","c")?

Desired output:

a: coef_score (float)
b: coef_score (float)
c: coef_score (float)

Thank you!

like image 400
Christopher Avatar asked Dec 18 '22 18:12

Christopher


1 Answers

Although your initial dataframe did indeed only contain columns for your three features a, b, and c, the Pandas DataFrameMapper() class applied SKlearn's CountVectorizer() to the respective word corpuses of each column a, b, and c. This resulted in the creation of a grand total of 24 features, which were then passed to your LogisticRegression() classifier. This is why you got an unlabeled list of 24 values when you tried to access the classifier's .coef_ attribute.

However, it's pretty straightforward to match each of those 24 coeff_ scores with the original column (a, b, or c) that they come from, and then calculate the average coefficient score for each column. Here's how we'd do it:

The original dataframe looks like this:

             a                   b                c   result
2   here we go   hello here we are   this is a test        0
73  here we go   hello here we are   this is a test        0
...

And if we run the following line, we can see a list of all the 24 features that were created by the DataFrameMapper/CountVectorizer() used in your mapper object:

pipeline.named_steps['featurize'].transformed_names_

['a_another',
 'a_example',
 'a_go',
 'a_here',
 'a_is',
 'a_we',
 'b_are',
 'b_column',
 'b_content',
 'b_every',
 'b_has',
 'b_hello',
 'b_here',
 'b_text',
 'b_we',
 'c_can',
 'c_deal',
 'c_feature',
 'c_how',
 'c_is',
 'c_test',
 'c_this',
 'c_union',
 'c_with']

len(pipeline.named_steps['featurize'].transformed_names_)

24

Now, here's how we'd calculate the average coef scores for the three sets of features that came from a/b/c columns:

col_names = list(data.drop(['result'], axis=1).columns.values)
vect_feats = pipeline.named_steps['featurize'].transformed_names_
clf_coef_scores = abs(pipeline.named_steps['clf'].coef_)

def get_avg_coef_scores(col_names, vect_feats, clf_coef_scores):
    scores = {}
    start_pos = 0
    for n in col_names:
        num_vect_feats = len([i for i in vect_feats if i[0] == n])
        end_pos = start_pos + num_vect_feats
        scores[n + '_avg_coef_score'] = np.mean(clf_coef_scores[0][start_pos:end_pos])
        start_pos = end_pos
    return scores

If we call the function we just wrote, we get the following output:

get_avg_coef_scores(col_names, vect_feats, clf_coef_scores)

{'a_avg_coef_score': 0.3499861323284858,
 'b_avg_coef_score': 0.40358462487685853,
 'c_avg_coef_score': 0.3918712435073411}

If we want to verify which of the 24 coeff scores belongs to each created feature, we can use the following dictionary comprehension:

{key:clf_coef_scores[0][i] for i, key in enumerate(vect_feats)}

{'a_another': 0.3567310993987888,
 'a_example': 0.3567310993987888,
 'a_go': 0.4621515317244458,
 'a_here': 0.10542043232565701,
 'a_is': 0.3567310993987888,
 'a_we': 0.4621515317244458,
 'b_are': 0.4621515317244458,
 'b_column': 0.3567310993987888,
 'b_content': 0.3567310993987888,
 'b_every': 0.3567310993987888,
 'b_has': 0.3567310993987888,
 'b_hello': 0.4621515317244458,
 'b_here': 0.4621515317244458,
 'b_text': 0.3567310993987888,
 'b_we': 0.4621515317244458,
 'c_can': 0.3567310993987888,
 'c_deal': 0.3567310993987888,
 'c_feature': 0.3567310993987888,
 'c_how': 0.3567310993987888,
 'c_is': 0.4621515317244458,
 'c_test': 0.4621515317244458,
 'c_this': 0.4621515317244458,
 'c_union': 0.3567310993987888,
 'c_with': 0.3567310993987888}
like image 78
James Dellinger Avatar answered Jan 14 '23 12:01

James Dellinger