I have a dataframe df
with a VectorUDT
column named features
. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf first_elem_udf = udf(lambda row: row.values[0]) df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype)
error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0])
instead.
I also tried explode()
but I get an error because it requires an array or map type.
This should be a common operation, I think.
In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .
You can find all column names & data types (DataType) of PySpark DataFrame by using df. dtypes and df. schema and you can also retrieve the data type of a specific column name using df. schema["name"].
Convert output to float
:
from pyspark.sql.types import DoubleType from pyspark.sql.functions import lit, udf def ith_(v, i): try: return float(v[i]) except ValueError: return None ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors df = sc.parallelize([ (1, Vectors.dense([1, 2, 3])), (2, Vectors.sparse(3, [1], [9])) ]).toDF(["id", "features"]) df.select(ith("features", lit(1))).show() ## +-----------------+ ## |ith_(features, 1)| ## +-----------------+ ## | 2.0| ## | 9.0| ## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values
(beware of SparseVectors
) you should use item
method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With