I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame (cv_predictions
) (see [1]). The probability
column (see [2]) is a vector
type (see [3]).
[1] type(cv_predictions_prod) pyspark.sql.dataframe.DataFrame [2] cv_predictions_prod.select('probability').show(10, False) +----------------------------------------+ |probability | +----------------------------------------+ |[0.31559134817066054,0.6844086518293395]| |[0.8937864350711228,0.10621356492887715]| |[0.8615878905395029,0.1384121094604972] | |[0.9594427633777901,0.04055723662220989]| |[0.5391547673698157,0.46084523263018434]| |[0.2820729747752462,0.7179270252247538] | |[0.7730465873083118,0.22695341269168817]| |[0.6346585276598942,0.3653414723401058] | |[0.6346585276598942,0.3653414723401058] | |[0.637279255218404,0.362720744781596] | +----------------------------------------+ only showing top 10 rows [3] cv_predictions_prod.printSchema() root ... |-- rawPrediction: vector (nullable = true) |-- probability: vector (nullable = true) |-- prediction: double (nullable = true)
How do I create parse the vector
of the PySpark DataFrame, such that I create a new column that just pulls the first element of each probability
vector?
This question is similar to, but the solutions in the links below didn't work/weren't clear to me:
How to access the values of denseVector in PySpark
How to access element of a VectorUDT column in a Spark DataFrame?
In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.
User-defined type for Vector which allows easy interaction with SQL via DataFrame .
Update:
It seems like there is a bug in spark that prevents you from accessing individual elements in a dense vector during a select statement. Normally you should would be able to access them just like you would a numpy array, but when trying to run the code previously posted, you may get the error pyspark.sql.utils.AnalysisException: "Can't extract value from probability#12;"
So, one way to handle this to avoid this silly bug is to use a udf. Similar to the other question, you can define a udf in the following way:
from pyspark.sql.functions import udf from pyspark.sql.types import FloatType firstelement=udf(lambda v:float(v[0]),FloatType()) cv_predictions_prod.select(firstelement('probability')).show()
Behind the scenes this still accesses the elements of the DenseVector like a numpy array, but it doesn't throw the same bug as before.
Since this is getting a lot of upvotes, I figured I should strike through the incorrect portion of this answer.
Original answer: A dense vector is just a wrapper for a numpy array. So you can access the elements in the same way that you would access the elements of a numpy array.
There are several ways to access individual elements of an array in a dataframe. One is to explicitly call the column cv_predictions_prod['probability']
in your select statement. By explicitly calling the column, you can perform operations on that column, like selecting the first element in the array. For example:
cv_predictions_prod.select(cv_predictions_prod['probability'][0]).show()
should solve the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With