Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding Representation of Vector Column in Spark SQL

Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so :

|  Numerical|  HotEncoded1|   HotEncoded2
|  14460.0|    (44,[5],[1.0])|     (3,[0],[1.0])|
|  14460.0|    (44,[9],[1.0])|     (3,[0],[1.0])|
|  15181.0|    (44,[1],[1.0])|     (3,[0],[1.0])|

The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes:

[(48,[0,1,9],[14460.0,1.0,1.0])]
[(48,[0,3,25],[12827.0,1.0,1.0])]
[(48,[0,1,18],[12828.0,1.0,1.0])]

I am unsure of what these numbers mean and cannot make sense of this transformed data set. Some clarification on what this output means would be great!

like image 836
user2253546 Avatar asked Jul 07 '16 01:07

user2253546


People also ask

How do I see the structure schema of the DataFrame in Spark SQL?

To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object. From the above example, printSchema() prints the schema to console( stdout ) and show() displays the content of the Spark DataFrame.

Why do we use VectorAssembler in Pyspark?

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

What option can be used to automatically infer the datatype of column?

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes.

What is infer schema in Spark?

2. inferSchema -> Infer schema will automatically guess the data types for each field. If we set this option to TRUE, the API will read some sample records from the file to infer the schema. If we want to set this value to false, we must specify a schema explicitly.


1 Answers

This output is not specific to VectorAssembler. It is just a string representation of o.a.s.ml.linalg.SparseVector (o.a.s.mllib.linalg.SparseVector in Spark < 2.0) with:

  • leading number representing the length of a vector
  • the first first set of numbers in brackets is a list of non-zero indices
  • the second set of numbers in brackets is a list of values corresponding to the indices

So (48,[0,1,9],[14460.0,1.0,1.0]) represents a vector of length 48, with three non-zero entries:

  • 14460.0 at the 0th position
  • 1.0 at the 1st position
  • 1.0 at the 9th position

Pretty much the same description applies to HotEncoded1 and HotEncoded2 and Numerical is just a scalar. Without seeing metadata and constructors it is not possible to tell much but encoded variables should have either 44 and 3 or 45 and 4 levels (depending on a dropLast parameter).

like image 55
zero323 Avatar answered Sep 19 '22 16:09

zero323