Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transfomers for mixed data types

I'm having trouble applying at once different transformers to columns with different types (text vs numerical), and concatenating such transformers in a single one for later use.

I tried to follow the steps in the documentation for Column Transformer with Mixed Types, which explains how to do that for a mix of categorical and numerical data, but it doesn't seem to work with text data.

TL;DR

How do you create a storable transformer that follows different pipelines for text and numerical data?

Data download and preparation

# imports
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler

np.random.seed(0)

# download Titanic data
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

# data preparation
numeric_features = ['age', 'fare']
text_features = ['name', 'cabin', 'home.dest']
X.fillna({text_col: '' for text_col in text_features}, inplace=True)

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Transforming numerical features: ok

Following the steps in the link above, one can create a transformer for the numerical features as follows:

# handling missing data and normalization
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

num_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])

# this works
num_preprocessor.fit(X_train)
train_feature_set = num_preprocessor.transform(X_train)
test_feature_set = num_preprocessor.transform(X_test)

# verify shape = (number of data points, number of numerical features (2) )
train_feature_set.shape  # (1047, 2)
test_feature_set.shape  # (262, 2)

Transforming text features: ok

To process text features, I vectorize each text column with TF-IDF (as opposed to concatenating all text columns, and applying TF-IDF just once):

# Tfidf of max 30 features
text_transformer = TfidfVectorizer(use_idf=True,
                                   max_features=30)
# apply separately to each column
text_transformer_list = [(x + '_vectorizer', text_transformer, x) for x in text_features]
text_preprocessor = ColumnTransformer(transformers=text_transformer_list)

# this works
text_preprocessor.fit(X_train)
train_feature_set = text_preprocessor.transform(X_train)
test_feature_set = text_preprocessor.transform(X_test)

# verify shape = (number of data points, number of text features (3) times max_features(30) )
train_feature_set.shape  # (1047, 90)
test_feature_set.shape  # (262, 90)

How do you do both at once?

I've tried various strategies to save both above procedures in a single transformer, but they all fail due to different errors.

Attempt 1: Follow documented strategy

Following the documentation (Column Transformer with Mixed Types) doesn't work, once text data replaces categorical data:

# documented strategy
sum_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
                                                   ('text', text_transformer, text_features)])
# fails
sum_preprocessor.fit(X_train)

returns following error message:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1047 and the array at index 1 has size 3

Attempt 2: FeatureUnion on the lists of transformers

# create a list of numerical transformer, like those for text
numerical_transformer_list = [(x + '_scaler', numeric_transformer, x) for x in numeric_features]

# fails
column_trans = FeatureUnion([text_transformer_list, numerical_transformer_list])

returns following error message:

TypeError: All estimators should implement fit and transform. '('cabin_vectorizer', TfidfVectorizer(max_features=30), 'cabin')' (type <class 'tuple'>) doesn't

Attempt 3: ColumnTransformer on the lists of transformers

# create a list of all transformers, text and numerical
sum_transformer_list = text_transformer_list + numerical_transformer_list

# works
sum_preprocessor = ColumnTransformer(transformers=sum_transformer_list)

# fails
sum_preprocessor.fit(X_train)

returns following error message:

ValueError: Expected 2D array, got 1D array instead:
array=[54. nan nan ... 20. nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

My question

How do I create a single object that can fit and transform data mixing text and numerical types?

like image 304
kilgoretrout Avatar asked Nov 07 '22 03:11

kilgoretrout


1 Answers

Short answer:

all_transformers = text_transformer_list + [('num', numeric_transformer, numeric_features)]

all_preprocessor = ColumnTransformer(transformers=all_transformers)

all_preprocessor.fit(X_train)
train_all = all_preprocessor.transform(X_train)
test_all = all_preprocessor.transform(X_test)

print(train_all.shape, test_all.shape)
# prints (1047, 92) (262, 92)

The difficulty here is that (most?) text transformers expect 1-dimensional input, but (most?) numerical transformers expect 2-dimensional input. ColumnTransformer handles that by allowing you to specify a single column or a list of columns: in the first case, the 1d array is passed on to the transformer, and in the second a 2d array is passed.

So, to explain the errors in the three attempts:

Attempt 1: The TF-IDF is receiving a 2d array, and treats the columns as the documents not the individual entries, and so produces just three outputs. When it tries to concatenate that to the 1047-row numerical output, it fails.

Attempt 2: FeatureUnion doesn't have the same input format as ColumnTransformer: you shouldn't have triples (name, transformer, columns) in this case. Anyway, FeatureUnion isn't meant for what you're doing here.

Attempt 3: This time you're trying to send 1d data through to the numerical transformer, but those are expecting 2d data.

like image 69
Ben Reiniger Avatar answered Nov 12 '22 17:11

Ben Reiniger