Before I used VectorAssembler() to consolidate some OneHotEncoded categorical features... My data frame looked like so :
| Numerical| HotEncoded1| HotEncoded2
| 14460.0| (44,[5],[1.0])| (3,[0],[1.0])|
| 14460.0| (44,[9],[1.0])| (3,[0],[1.0])|
| 15181.0| (44,[1],[1.0])| (3,[0],[1.0])|
The first column is a numerical column and the other two columns represent the transformed data set for OneHotEncoded categorical features. After applying VectorAssembler(), my output becomes:
[(48,[0,1,9],[14460.0,1.0,1.0])]
[(48,[0,3,25],[12827.0,1.0,1.0])]
[(48,[0,1,18],[12828.0,1.0,1.0])]
I am unsure of what these numbers mean and cannot make sense of this transformed data set. Some clarification on what this output means would be great!
To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object. From the above example, printSchema() prints the schema to console( stdout ) and show() displays the content of the Spark DataFrame.
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes.
2. inferSchema -> Infer schema will automatically guess the data types for each field. If we set this option to TRUE, the API will read some sample records from the file to infer the schema. If we want to set this value to false, we must specify a schema explicitly.
This output is not specific to VectorAssembler
. It is just a string representation of o.a.s.ml.linalg.SparseVector
(o.a.s.mllib.linalg.SparseVector
in Spark < 2.0) with:
So (48,[0,1,9],[14460.0,1.0,1.0])
represents a vector of length 48, with three non-zero entries:
Pretty much the same description applies to HotEncoded1
and HotEncoded2
and Numerical
is just a scalar. Without seeing metadata and constructors it is not possible to tell much but encoded variables should have either 44 and 3 or 45 and 4 levels (depending on a dropLast
parameter).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With