Are all the features correctly selected and used in a classifier?

Question

I would like to know if when I use a classifier, for example:

random_forest_bow = Pipeline([
        ('rf_tfidf',Feat_Selection. countV),
        ('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
        ])
    
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])

I am also considering other features in the model. I defined X and y as follows:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40) 

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)

My dataset looks as follows

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

I would like to use as features also is_it_capital?,is_it_upper?,contains_num?, but since they have binary values (1 or 0, after encoding), I should apply BoW only on Text to extract extra features. Maybe my question is obvious, but since I am a new ML learner and I am not familiar with classifiers and encoding, I will be thankful for all the support and comments that you will provide. Thanks

blacksite · Accepted Answer

You can certainly use your "extra" features like is_it_capital?, is_it_upper?, and contains_num?. It seems you're struggling with how exactly to combine the two seemingly disparate feature sets. You could use something like sklearn.pipeline.FeatureUnion or sklearn.compose.ColumnTransformer to apply your different encoding strategies to each set of features. There's no reason you couldn't use your extra features in combinations with whatever a text-feature extraction method (e.g. your BoW approach) would produce.

df = pd.DataFrame({'text': ['this is some text', 'this is some MORE text', 'hi hi some text 123', 'bananas oranges'], 'is_it_upper': [0, 1, 0, 0], 'contains_num': [0, 0, 1, 0]})

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([('text', CountVectorizer(), 'text')], remainder='passthrough')
X = transformer.fit_transform(df)

print(X)
[[0 0 0 1 0 0 1 1 1 0 0]
 [0 0 0 1 1 0 1 1 1 1 0]
 [1 0 2 0 0 0 1 1 0 0 1]
 [0 1 0 0 0 1 0 0 0 0 0]]
print(transformer.get_feature_names())
['text__123', 'text__bananas', 'text__hi', 'text__is', 'text__more', 'text__oranges', 'text__some', 'text__text', 'text__this', 'is_it_upper', 'contains_num']

More on your specific example:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

# Need to use DenseTransformer to properly concatenate results
# from CountVectorizer and other transformer steps
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X, y=None, **fit_params):
        return X.todense()

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
])

transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40)

X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

Prayson W. Daniel · Answer

What I found useful is to have my transformation in a way that I have total control. For each set of columns, I would perform a specific transformation, and then in the end I union my transformations: Here is example

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.ensemble import RandomForestClassifier

# boolean
boolean_features = ['is_it_capital?', 'is_it_upper?','contains_num?',]
boolen_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent',)),
)
    ]
)

text_features = 'Text'
text_transformer = Pipeline(
    steps=[('vectorizer', CountVectorizer())]
)

# merge all pipelines

preprocessor = ColumnTransformer(
    transformers=[
        ('bool', boolean_transformer, boolean_features),
        ('text', text_transformer, text_features),
    ]
)

pipelines = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', RandomForestClassifier(n_estimators=300,n_jobs=3))
    ]
)

# spilt data to train and test
X_train_, X_test, y_train_, y_test = train_test_split(X, y, test_size=.1, random_state=42, stratify=y)


# we can train our model
pipelines.fit(X_train, y_train)
pipeline.score(X_test, y_test)

# what is awesome is using other tools like GridSearch becomes easy.

params = {'model__ n_estimators': [100, 200, 300], 'model__ criterion': ['gini', 'entropy']}

clf = GridSearchCV(
    pipelines, cv=5, n_jobs=-1, param_grid=params, scoring='roc_auc'
)

clf.fit(X_train, y_train)

# predict for totally unseen data
clf.predict(X_test)

Updates

If we have columns that needs no transformation and need to be included, add remainder='passthrough'

# assumption: above code does not have boolen_X
# ...
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, text_features),

    ], remainder='passthrough'
)
#...

See scikit-learn documentations and usage examples:

ColumnTransformer

Are all the features correctly selected and used in a classifier?

Tags:

python

machine-learning

scikit-learn

feature-selection

LdM

2 Answers

blacksite

Updates

Prayson W. Daniel

Recent Activity

Donate For Us

Are all the features correctly selected and used in a classifier?

Tags:

python

machine-learning

scikit-learn

feature-selection

LdM

2 Answers

blacksite

Updates

Prayson W. Daniel

Related questions

Recent Activity

Donate For Us