I'm trying to understand DataFrame column types. Of course, DataFrame is not a materialized object, it’s just a set of instructions for Spark, to be converted into code in the future. But I imagined that this list of types represents the types of objects that might materialize inside the JVM when an action is executed.
import pyspark
import pyspark.sql.types as T
import pyspark.sql.functions as F
data = [0, 3, 0, 4]
d = {}
d['DenseVector'] = pyspark.ml.linalg.DenseVector(data)
d['old_DenseVector'] = pyspark.mllib.linalg.DenseVector(data)
d['SparseVector'] = pyspark.ml.linalg.SparseVector(4, dict(enumerate(data)))
d['old_SparseVector'] = pyspark.mllib.linalg.SparseVector(4, dict(enumerate(data)))
df = spark.createDataFrame([d])
df.printSchema()
The columns for the four vector values look the same in printSchema()
(or schema
):
root
|-- DenseVector: vector (nullable = true)
|-- SparseVector: vector (nullable = true)
|-- old_DenseVector: vector (nullable = true)
|-- old_SparseVector: vector (nullable = true)
But when I retrieve them row by row, they turn out to be different:
> for x in df.first().asDict().items():
print(x[0], type(x[1]))
(2) Spark Jobs
old_SparseVector <class 'pyspark.mllib.linalg.SparseVector'>
SparseVector <class 'pyspark.ml.linalg.SparseVector'>
old_DenseVector <class 'pyspark.mllib.linalg.DenseVector'>
DenseVector <class 'pyspark.ml.linalg.DenseVector'>
I'm confused about the meaning of vector
type (equivalent to VectorUDT
in for purposes of writing a UDF). How does the DataFrame
know which of the four vector types it has in each vector
column? Is the data in those vector columns stored in the JVM or in python VM? And how come VectorUDT
can be stored in the DataFrame
, if it's not one of the official types listed here?
(I know that two of the four vector types, from the mllib.linalg
, will eventually be deprecated.)
how come VectorUDT can be stored in the DataFrame
UDT
a.k.a User Defined Type should be a hint here. Spark provides (now private) mechanism to store custom objects in DataFrame
. You can check my answer to How to define schema for custom type in Spark SQL? or Spark source for details but long story short it is all about deconstructing objects and encoding them as Catalyst types.
I'm confused about the meaning of vector type
Most likely because you're looking at the wrong thing. Short description is useful but it doesn't determine types. Instead you should check the schema. Let's create another data frame:
import pyspark.mllib.linalg as mllib
import pyspark.ml.linalg as ml
df = sc.parallelize([
(mllib.DenseVector([1, ]), ml.DenseVector([1, ])),
(mllib.SparseVector(1, [0, ], [1, ]), ml.SparseVector(1, [0, ], [1, ]))
]).toDF(["mllib_v", "ml_v"])
df.show()
## +-------------+-------------+
## | mllib_v| ml_v|
## +-------------+-------------+
## | [1.0]| [1.0]|
## |(1,[0],[1.0])|(1,[0],[1.0])|
## +-------------+-------------+
and check data types:
{s.name: type(s.dataType) for s in df.schema}
## {'ml_v': pyspark.ml.linalg.VectorUDT,
## 'mllib_v': pyspark.mllib.linalg.VectorUDT}
As you can see UDT types are fully qualified so there is no confusion here.
How does the DataFrame know which of the four vector types it has in each vector column?
As shown above DataFrame
knows only its schema and can distinguish between ml
/ mllib
types but doesn't care about vector variant (sparse or dense).
Vector type is determined by its type
field (a byte
field, 0 -> sparse, 1 -> dense) but overall schema is the same. Also there is no difference in internal representation between ml
and mllib
.
Is the data in those vector columns stored in the JVM or in Python
DataFrame
is a pure JVM entity. Python interoperability is achieved by coupled UDT classes:
pyUDT
attribute.scalaUDT
attribute.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With