Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

Tags:

I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame (cv_predictions) (see [1]). The probability column (see [2]) is a vector type (see [3]).

[1] type(cv_predictions_prod) pyspark.sql.dataframe.DataFrame  [2] cv_predictions_prod.select('probability').show(10, False) +----------------------------------------+ |probability                             | +----------------------------------------+ |[0.31559134817066054,0.6844086518293395]| |[0.8937864350711228,0.10621356492887715]| |[0.8615878905395029,0.1384121094604972] | |[0.9594427633777901,0.04055723662220989]| |[0.5391547673698157,0.46084523263018434]| |[0.2820729747752462,0.7179270252247538] | |[0.7730465873083118,0.22695341269168817]| |[0.6346585276598942,0.3653414723401058] | |[0.6346585276598942,0.3653414723401058] | |[0.637279255218404,0.362720744781596]   | +----------------------------------------+ only showing top 10 rows  [3] cv_predictions_prod.printSchema() root  ...  |-- rawPrediction: vector (nullable = true)  |-- probability: vector (nullable = true)  |-- prediction: double (nullable = true)

How do I create parse the vector of the PySpark DataFrame, such that I create a new column that just pulls the first element of each probability vector?

This question is similar to, but the solutions in the links below didn't work/weren't clear to me:

How to access the values of denseVector in PySpark

How to access element of a VectorUDT column in a Spark DataFrame?

527

asked Jun 08 '17 01:06

user2205916

1 Answers

Update:

It seems like there is a bug in spark that prevents you from accessing individual elements in a dense vector during a select statement. Normally you should would be able to access them just like you would a numpy array, but when trying to run the code previously posted, you may get the error pyspark.sql.utils.AnalysisException: "Can't extract value from probability#12;"

So, one way to handle this to avoid this silly bug is to use a udf. Similar to the other question, you can define a udf in the following way:

from pyspark.sql.functions import udf from pyspark.sql.types import FloatType  firstelement=udf(lambda v:float(v[0]),FloatType()) cv_predictions_prod.select(firstelement('probability')).show()

Behind the scenes this still accesses the elements of the DenseVector like a numpy array, but it doesn't throw the same bug as before.

Since this is getting a lot of upvotes, I figured I should strike through the incorrect portion of this answer.

~~Original answer: A dense vector is just a wrapper for a numpy array. So you can access the elements in the same way that you would access the elements of a numpy array.~~

There are several ways to access individual elements of an array in a dataframe. One is to explicitly call the column cv_predictions_prod['probability'] in your select statement. By explicitly calling the column, you can perform operations on that column, like selecting the first element in the array. For example:

cv_predictions_prod.select(cv_predictions_prod['probability'][0]).show()

should solve the problem.

112

answered Oct 12 '22 23:10

DavidWayne

Related questions
                            
                                How to run functions outside websocket loop in python (tornado)
                            
                                Getting only those values that fulfill a condition in a numpy array
                            
                                How do I apply some function to a python meshgrid?
                            
                                Live stdout output from Python subprocess in Jupyter notebook
                            
                                Regular Expression Matching First Non-Repeated Character
                            
                                Applying Spacy Parser to Pandas DataFrame w/ Multiprocessing
                            
                                When is a function in a standard library module called a built-in function?
                            
                                Python module for wiki markup
                            
                                Regex in python: is it possible to get the match, replacement, and final string?
                            
                                pyserial: No module named tools
                            
                                Is there a standard solution for Gauss elimination in Python?
                            
                                How do I store data from the Bloomberg API into a Pandas dataframe?
                            
                                how to explain the decision tree from scikit-learn
                            
                                Why is it valid to assign to an empty list but not to an empty tuple?
                            
                                Replace nth occurrence of substring in string
                            
                                Make a deep copy of a keras model in python
                            
                                Fastest way to download 3 million objects from a S3 bucket
                            
                                How to plot a density map in python?
                            
                                Pretty print namedtuple
                            
                                What is the meaning of [:] in python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]

Tags:

python

apache-spark

pyspark

spark-dataframe

apache-spark-ml

user2205916

People also ask

1 Answers

DavidWayne

Recent Activity

Donate For Us