what is the difference between FeatureUnion() and ColumnTransformer() in sklearn?
which should i use if i want to build a supervised model with features containing mixed data types (categorical, numeric, unstructured text) where i need to combine separate pipelines?
source: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
source: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
According to the sklearn documentation:
FeatureUnion: Concatenates results of multiple transformer objects. This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.
ColumnTransformer: Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
So, FeatureUnion applies different transformers to the whole of the input data and then combines the results by concatenating them.
ColumnTransformer, on the other hand, applies different transformers to different subsets of the whole input data, and again concatenates the results.
For the case you propose, the ColumnTransformer should be the first step. And then, once all the columns are converted to numeric, with FeatureUnion you could transform them even further by, e.g., combining PCA and SelectKBest
Finally, you could certainly use FeatureUnion as a ColumnTransformer, but you would have to include in each of the branches a column/type selector than only feeds into the next transformer down the pipeline the columns of interest, as it is explained here: https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/
However, ColumnTransformer does exactly that and in a simpler way.
Both of these methods are used to combine independent transformations (transformers) into a single transformer, by independent I mean transformation (transformers) that don't need to be executed in a defined order. That's because unlike in regular pipelines, one transformer is not applied to the output of another transformer.
The main difference is that: each transformer in a feature union object gets the whole data as input. While in column transformer object they get only part of the data as input. Both of them concatenate the results of each transformer in the end. Both can use parallel processing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With