Understanding Representation of Vector Column in Spark SQL

Tags:

Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so :

|  Numerical|  HotEncoded1|   HotEncoded2
|  14460.0|    (44,[5],[1.0])|     (3,[0],[1.0])|
|  14460.0|    (44,[9],[1.0])|     (3,[0],[1.0])|
|  15181.0|    (44,[1],[1.0])|     (3,[0],[1.0])|

The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes:

[(48,[0,1,9],[14460.0,1.0,1.0])]
[(48,[0,3,25],[12827.0,1.0,1.0])]
[(48,[0,1,18],[12828.0,1.0,1.0])]

I am unsure of what these numbers mean and cannot make sense of this transformed data set. Some clarification on what this output means would be great!

836

asked Jul 07 '16 01:07

user2253546

1 Answers

This output is not specific to VectorAssembler. It is just a string representation of o.a.s.ml.linalg.SparseVector (o.a.s.mllib.linalg.SparseVector in Spark < 2.0) with:

leading number representing the length of a vector
the first first set of numbers in brackets is a list of non-zero indices
the second set of numbers in brackets is a list of values corresponding to the indices

So (48,[0,1,9],[14460.0,1.0,1.0]) represents a vector of length 48, with three non-zero entries:

14460.0 at the 0th position
1.0 at the 1st position
1.0 at the 9th position

Pretty much the same description applies to HotEncoded1 and HotEncoded2 and Numerical is just a scalar. Without seeing metadata and constructors it is not possible to tell much but encoded variables should have either 44 and 3 or 45 and 4 levels (depending on a dropLast parameter).

answered Sep 19 '22 16:09

zero323

Related questions
                            
                                Generating monthly timestamps between two dates in pyspark dataframe
                            
                                Efficient pyspark join
                            
                                PySpark: filtering with isin returns empty dataframe
                            
                                Assign a variable a dynamic value in SQL in Databricks / Spark
                            
                                How to get output after running Apache Spark job on web
                            
                                Spark TF-IDF getting the words back from hash
                            
                                Spark: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
                            
                                Why is SparkListenerApplicationStart never fired?
                            
                                will Spark support Clojure?
                            
                                mapPartitions returns empty array
                            
                                How to Get the file name for record in spark RDD (JavaRDD)
                            
                                Spark withColumn() performing power functions
                            
                                how to distinguish an operation in spark is a transformation or an action?
                            
                                'SparkContext' object has no attribute 'textfile'
                            
                                Spark SQL - Generate array of arrays from the sql function
                            
                                PySpark - Add a new column with a Rank by User
                            
                                Spark Scala: retrieve the schema and store it
                            
                                How to write a DataFrame schema to file in Scala
                            
                                How to Create a Database in Spark SQL
                            
                                Invalidate metadata/refresh imapala from spark code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding Representation of Vector Column in Spark SQL

Tags:

apache-spark

apache-spark-sql

apache-spark-ml

apache-spark-mllib

user2253546

People also ask

1 Answers

zero323

Recent Activity

Donate For Us