I would like to know if when I use a classifier, for example:
random_forest_bow = Pipeline([
('rf_tfidf',Feat_Selection. countV),
('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
])
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])
I am also considering other features in the model. I defined X and y as follows:
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)
My dataset looks as follows
Text is_it_capital? is_it_upper? contains_num? Label
an example of text 0 0 0 0
ANOTHER example of text 1 1 0 1
What's happening?Let's talk at 5 1 0 1 1
I would like to use as features also is_it_capital?
,is_it_upper?
,contains_num?
, but since they have binary values (1 or 0, after encoding), I should apply BoW only on Text to extract extra features.
Maybe my question is obvious, but since I am a new ML learner and I am not familiar with classifiers and encoding, I will be thankful for all the support and comments that you will provide. Thanks
You can certainly use your "extra" features like is_it_capital?
, is_it_upper?
, and contains_num?
. It seems you're struggling with how exactly to combine the two seemingly disparate feature sets. You could use something like sklearn.pipeline.FeatureUnion or sklearn.compose.ColumnTransformer to apply your different encoding strategies to each set of features. There's no reason you couldn't use your extra features in combinations with whatever a text-feature extraction method (e.g. your BoW approach) would produce.
df = pd.DataFrame({'text': ['this is some text', 'this is some MORE text', 'hi hi some text 123', 'bananas oranges'], 'is_it_upper': [0, 1, 0, 0], 'contains_num': [0, 0, 1, 0]})
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer([('text', CountVectorizer(), 'text')], remainder='passthrough')
X = transformer.fit_transform(df)
print(X)
[[0 0 0 1 0 0 1 1 1 0 0]
[0 0 0 1 1 0 1 1 1 1 0]
[1 0 2 0 0 0 1 1 0 0 1]
[0 1 0 0 0 1 0 0 0 0 0]]
print(transformer.get_feature_names())
['text__123', 'text__bananas', 'text__hi', 'text__is', 'text__more', 'text__oranges', 'text__some', 'text__text', 'text__this', 'is_it_upper', 'contains_num']
More on your specific example:
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
# Need to use DenseTransformer to properly concatenate results
# from CountVectorizer and other transformer steps
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return X.todense()
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('to_dense', DenseTransformer()),
])
transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)
df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
What I found useful is to have my transformation in a way that I have total control. For each set of columns, I would perform a specific transformation, and then in the end I union my transformations: Here is example
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# boolean
boolean_features = ['is_it_capital?', 'is_it_upper?','contains_num?',]
boolen_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='most_frequent',)),
)
]
)
text_features = 'Text'
text_transformer = Pipeline(
steps=[('vectorizer', CountVectorizer())]
)
# merge all pipelines
preprocessor = ColumnTransformer(
transformers=[
('bool', boolean_transformer, boolean_features),
('text', text_transformer, text_features),
]
)
pipelines = Pipeline(
steps=[
('preprocessor', preprocessor),
('model', RandomForestClassifier(n_estimators=300,n_jobs=3))
]
)
# spilt data to train and test
X_train_, X_test, y_train_, y_test = train_test_split(X, y, test_size=.1, random_state=42, stratify=y)
# we can train our model
pipelines.fit(X_train, y_train)
pipeline.score(X_test, y_test)
# what is awesome is using other tools like GridSearch becomes easy.
params = {'model__ n_estimators': [100, 200, 300], 'model__ criterion': ['gini', 'entropy']}
clf = GridSearchCV(
pipelines, cv=5, n_jobs=-1, param_grid=params, scoring='roc_auc'
)
clf.fit(X_train, y_train)
# predict for totally unseen data
clf.predict(X_test)
If we have columns that needs no transformation and need to be included, add remainder='passthrough'
# assumption: above code does not have boolen_X
# ...
preprocessor = ColumnTransformer(
transformers=[
('text', text_transformer, text_features),
], remainder='passthrough'
)
#...
See scikit-learn documentations and usage examples:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With