I am running a very simple experiment with ColumnTransformer
with an intent to transform an array of columns, ["a"] in this example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],"c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf", tfidf, ["a"])],remainder="passthrough")
clmn.fit_transform(dataset)
Which gives me:
ValueError: empty vocabulary; perhaps the documents only contain stop words
Obviously, TfidfVectorizer
can do fit_transform()
on its own:
tfidf.fit_transform(dataset.a)
<2x5 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
What could be a reason for such an error and how to correct for it?
That's because you are providing ["a"]
instead of "a"
in ColumnTransformer
. According to the documentation:
A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.
Now, TfidfVectorizer
requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer
(even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer
. And hence the error.
Change that to:
clmn = ColumnTransformer([("tfidf", tfidf, "a")],
remainder="passthrough")
For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:
dataset['a']
vs
dataset[['a']]
Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:
tfidf_1 = TfidfVectorizer(min_df=0)
tfidf_2 = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"),
("tfidf_2", tfidf_2, "b")
],
remainder="passthrough")
we can create a custom tfidf transformer, which can take a array of columns and then join them before applying .fit()
or .transform()
.
Try this!
from sklearn.base import BaseEstimator,TransformerMixin
class custom_tfidf(BaseEstimator,TransformerMixin):
def __init__(self,tfidf):
self.tfidf = tfidf
def fit(self, X, y=None):
joined_X = X.apply(lambda x: ' '.join(x), axis=1)
self.tfidf.fit(joined_X)
return self
def transform(self, X):
joined_X = X.apply(lambda x: ' '.join(x), axis=1)
return self.tfidf.transform(joined_X)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","word gone with wind"],
"b":[" gone fhgf wild","gone with wind"],
"c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf", custom_tfidf(tfidf), ['a','b'])],remainder="passthrough")
clmn.fit_transform(dataset)
#
array([[0.36439074, 0.51853403, 0.72878149, 0. , 0. ,
0.25926702, 1. ],
[0. , 0.438501 , 0. , 0.61629785, 0.61629785,
0.2192505 , 2. ]])
P.S. : May be you might want to create a tfidf vectorizer for each column, then create a dictionary with key as column name and value as fitted vectorizer. This dictionary can be used during transform of corresponding columns
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With