Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ColumnTransformer with TfidfVectorizer produces "empty vocabulary" error

I am running a very simple experiment with ColumnTransformer with an intent to transform an array of columns, ["a"] in this example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],"c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf", tfidf, ["a"])],remainder="passthrough")
clmn.fit_transform(dataset)

Which gives me:

ValueError: empty vocabulary; perhaps the documents only contain stop words

Obviously, TfidfVectorizer can do fit_transform() on its own:

tfidf.fit_transform(dataset.a)
<2x5 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

What could be a reason for such an error and how to correct for it?

like image 796
Sergey Bushmanov Avatar asked Feb 14 '19 16:02

Sergey Bushmanov


Video Answer


2 Answers

That's because you are providing ["a"] instead of "a" in ColumnTransformer. According to the documentation:

A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.

Now, TfidfVectorizer requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer (even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer. And hence the error.

Change that to:

clmn = ColumnTransformer([("tfidf", tfidf, "a")],
                         remainder="passthrough")

For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:

dataset['a']

vs 

dataset[['a']]

Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:

tfidf_1 = TfidfVectorizer(min_df=0)
tfidf_2 = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"), 
                          ("tfidf_2", tfidf_2, "b")
                         ],
                         remainder="passthrough")
like image 121
Vivek Kumar Avatar answered Sep 22 '22 02:09

Vivek Kumar


we can create a custom tfidf transformer, which can take a array of columns and then join them before applying .fit() or .transform().

Try this!

from sklearn.base import BaseEstimator,TransformerMixin

class custom_tfidf(BaseEstimator,TransformerMixin):
    def __init__(self,tfidf):
        self.tfidf = tfidf

    def fit(self, X, y=None):
        joined_X = X.apply(lambda x: ' '.join(x), axis=1)
        self.tfidf.fit(joined_X)        
        return self

    def transform(self, X):
        joined_X = X.apply(lambda x: ' '.join(x), axis=1)

        return self.tfidf.transform(joined_X)        

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","word gone with wind"],
                        "b":[" gone fhgf wild","gone with wind"],
                        "c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)

clmn = ColumnTransformer([("tfidf", custom_tfidf(tfidf), ['a','b'])],remainder="passthrough")
clmn.fit_transform(dataset)

#
array([[0.36439074, 0.51853403, 0.72878149, 0.        , 0.        ,
        0.25926702, 1.        ],
       [0.        , 0.438501  , 0.        , 0.61629785, 0.61629785,
        0.2192505 , 2.        ]])

P.S. : May be you might want to create a tfidf vectorizer for each column, then create a dictionary with key as column name and value as fitted vectorizer. This dictionary can be used during transform of corresponding columns

like image 31
Venkatachalam Avatar answered Sep 21 '22 02:09

Venkatachalam