Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: combining output of two VectorAssemblers

Using pyspark, ​I have created two VectorAssemblers, the first with multiple numeric columns ('colA', 'colB', 'colC'), and the second with multiple categorical columns ('colD', 'colE', I applied OneHotEncoder on each column).

I could create these VectorAssemblers separately. How can I combine the outputs into a single vector column (so that I can feed it into a Xgboost model)?

I tried the following, but got "TypeError: can only concatenate str (not "list") to str"

# my dataframe with all columns is df

# VectorAssembler 1: with 3 numeric columns 
numeric_cols = ['colA', 'colB', 'colC']
assembler = VectorAssembler(
    inputCols= numeric_cols,
    outputCol="numericFeatures"
)


# VectorAssembler 2: with 2 categorical columns
categ_cols = ['colD', 'colE']
indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
    for c in categ_cols
]
encoders = [
    OneHotEncoder(
        inputCol=indexer.getOutputCol(),
        outputCol="{0}_encoded".format(indexer.getOutputCol())) 
    for indexer in indexers
]
assemblerCateg = VectorAssembler(
    inputCols = [encoder.getOutputCol() for encoder in encoders],
    outputCol = "categFeatures"
)


pipeline = Pipeline(stages = [assembler] + indexers + encoders + [assemblerCateg])
df2 = pipeline.fit(df).transform(df)
like image 966
YaleBD Avatar asked Oct 31 '25 17:10

YaleBD


1 Answers

Solved it! Just use another VectorAssembler (at the end) before the pipeline:

assemblerAll = VectorAssembler(inputCols= ["numericFeatures", "categFeatures"], outputCol="allFeatures")
pipeline = Pipeline(stages = [assembler] + indexers + encoders + [assemblerCateg] + [assemblerAll])
like image 90
YaleBD Avatar answered Nov 02 '25 08:11

YaleBD



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!