Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do you need to scale Vectorizers in sklearn?

I have a set of custom features and a set of features created with Vectorizers, in this case TfidfVectorizer.

All of my custom features are simple np.arrays (e.g. [0, 5, 4, 22, 1]). I am using StandardScaler to scale all of my featues, as you can see in my Pipeline by calling StandardScaler after my "custom pipeline". The question is whether there is a way or a need to scale the Vectorizers I use in my "vectorized_pipeline". Applying StandardScaler on the vectorizers doesn't seem to work (I get the following Error: "ValueError: Cannot center sparse matrices").

And another question, is it sensible to scale all of my features after I have joined them in the FeatureUnion or do I scale each of them separately (in my example, by calling the scaler in "pos_cluster" and "stylistic_features" seprately instead of calling it after the both of them have been joined), what is a better practice of doing this?

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

inner_scaler = StandardScaler()
# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

# vectorizers
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)


pipeline = Pipeline([
    ('union', FeatureUnion(
            transformer_list=[

            ('vectorized_pipeline', Pipeline([
                ('union_vectorizer', FeatureUnion([

                    ('stem_text', Pipeline([
                        ('selector', ItemSelector(key='stem_text')),
                        ('stem_tfidf', countVecWord)
                    ])),

                    ('pos_text', Pipeline([
                        ('selector', ItemSelector(key='pos_text')),
                        ('pos_tfidf', countVecWord_tags)
                    ])),

                ])),
                ])),


            ('custom_pipeline', Pipeline([
                ('custom_features', FeatureUnion([

                    ('pos_cluster', Pipeline([
                        ('selector', ItemSelector(key='pos_text')),
                        ('pos_cluster_inner', pos_cluster)
                    ])),

                    ('stylistic_features', Pipeline([
                        ('selector', ItemSelector(key='raw_text')),
                        ('stylistic_features_inner', stylistic_features)
                    ]))

                ])),
                    ('inner_scale', inner_scaler)
            ])),

            ],

            # weight components in FeatureUnion
            # n_jobs=6,

            transformer_weights={
                'vectorized_pipeline': 0.8,  # 0.8,
                'custom_pipeline': 1.0  # 1.0
            },
    )),

    ('clf', classifier),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
like image 416
Ivan Bilan Avatar asked Jan 06 '23 06:01

Ivan Bilan


1 Answers

First things first:

Error "Cannot center sparse matrices"

The reason is quite simple - StandardScaler efficiently applies feature-wise transformation:

f_i = (f_i - mean(f_i)) / std(f_i)

which for sparse matrices will result in the dense ones, as mean(f_i) will be non zero (usually). In practise only features equal to their means - will end up being zero. Scikit learn does not want to do this, as this is a huge modification of your data, which might result in failures in other parts of code, huge usage of memory etc. How to deal with it? If you really want to do this, there are two options:

  • densify your matrix through .toarray(), which will require lots of memory, but will give you exactly what you expect
  • create StandardScaler without mean, thus StandardScaler(with_mean = False) which instaed willl apply f_i = f_i / std(f_i), but will leave sparse format of your data.

Is scalind needed?

This is a whole other problem - usualy, scaling (of any form) is just a heuristics. This is not something that you have to apply, there are no guarantees that it will help, it is just a reasonable thing to do when you have no idea what your data looks like. "Smart" vectorizers, such as tfidf are actually already doing that. The idf transformation is supposed to create a kind of reasonable data scaling. There is no guarantee which one will be better, but in general, tfidf should be enough. Especially given the fact, that it still support sparse computations, while StandardScaler does not.

like image 179
lejlot Avatar answered Jan 10 '23 06:01

lejlot