My python version is 3.6.3 and spark version is 2.2.1. Here is my code:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession.builder.appName("Data Preprocessor") \
.config("spark.some.config.option", "1") \
.getOrCreate()
dataset = spark.createDataFrame([(0, 59.0, 0.0, Vectors.dense([2.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 9.0, 9.0, 9.0]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features").show(truncate=False)
Instead of getting a single vector, I am getting following output:
(12,[0,2,9,10,11],[59.0,2.0,9.0,9.0,9.0])
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
Create a sparse vector, using either a dictionary, a list of (index, value) pairs, or two separate arrays of indices and values (sorted by index). Squared distance between two vectors. Create a dense vector of 64-bit floats from a Python list or numbers. Find norm of the given vector.
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines. class pyspark.ml. Transformer [source] Abstract class for transformers that transform one dataset into another. New in version 1.3.
The vector returned by vectorAssembler is in sparseVector form. 12 is the number of features. ([0,2,9,10,11]) are the indices of the non-zero values. [59.0,2.0,9.0,9.0,9.0] are the non-zero values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With