Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]

Tags:

My python version is 3.6.3 and spark version is 2.2.1. Here is my code:

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

sc = SparkContext()
spark = SparkSession.builder.appName("Data Preprocessor") \
        .config("spark.some.config.option", "1") \
        .getOrCreate()

dataset = spark.createDataFrame([(0, 59.0, 0.0, Vectors.dense([2.0, 0.0, 
          0.0, 0.0, 0.0, 0.0, 0.0, 9.0, 9.0, 9.0]), 1.0)],
          ["id", "hour", "mobile", "userFeatures", "clicked"])

assembler = VectorAssembler(inputCols=["hour", "mobile", "userFeatures"], 
outputCol="features")

output = assembler.transform(dataset)
output.select("features").show(truncate=False)

Instead of getting a single vector, I am getting following output:

(12,[0,2,9,10,11],[59.0,2.0,9.0,9.0,9.0])

437

asked Feb 06 '18 09:02

Mir Md Faysal

1 Answers

The vector returned by vectorAssembler is in sparseVector form. 12 is the number of features. ([0,2,9,10,11]) are the indices of the non-zero values. [59.0,2.0,9.0,9.0,9.0] are the non-zero values.

117

answered Oct 20 '22 01:10

pauli

Related questions
                            
                                TypeError: a bytes-like object is required, not 'str' in subprocess.check_output
                            
                                ModuleNotFoundError: No module named 'import_export'
                            
                                Is it safe to call `setup()` multiple times in a single `setup.py`?
                            
                                Missing table name in IntegrityError (Django ORM)
                            
                                Is it possible to annotate a seaborn violin plot with number of observations in each group?
                            
                                pandas DataFrame to_sql Python
                            
                                pandas grouper vs time grouper
                            
                                Does Jupyter support 'read-only' notebooks?
                            
                                Cannot run tensorflow on GPU
                            
                                Remove empty partitions in Dask
                            
                                Dealing with large numbers in R [Inf] and Python
                            
                                AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis', using pandas eval
                            
                                How to get previous frame of a video in opencv python
                            
                                How do I generate a sine wave using Python?
                            
                                Using Sql Server with Django 2.0
                            
                                Spacy NLP library: what is maximum reasonable document size
                            
                                Pandas extractall() - return list, not a MultiLevel index?
                            
                                Create a custom sklearn TransformerMixin that transforms categorical variables consistently
                            
                                Browser loads old bundle.js from Webpack
                            
                                How to centre a colorbar at a defined value for seaborn heatmap?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]

Tags:

python

apache-spark

pyspark

apache-spark-mllib

Mir Md Faysal

People also ask

1 Answers

pauli

Recent Activity

Donate For Us