Pyspark: K means result with distance or deviation?

1 Answers

Let's assume we have the following sample data and kmeans model :

from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans
import pyspark.sql.functions as F

data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
        (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),),
        (Vectors.dense([10.0, 1.5]),), (Vectors.dense([11, 0.0]),) ]
df = spark.createDataFrame(data, ["features"])

n_centres = 2
kmeans = KMeans().setK(n_centres).setSeed(1)
kmModel = kmeans.fit(df)
df_pred = kmModel.transform(df)
df_pred.show()

+----------+----------+
|  features|prediction|
+----------+----------+
| [0.0,0.0]|         1|
| [1.0,1.0]|         1|
| [9.0,8.0]|         0|
| [8.0,9.0]|         0|
|[10.0,1.5]|         0|
|[11.0,0.0]|         0|
+----------+----------+

Now, let's add a column containing the centers' coordinate :

l_clusters = kmModel.clusterCenters()
# Let's convert the list of centers to a dict, each center is a list of float
d_clusters = {int(i):[float(l_clusters[i][j]) for j in range(len(l_clusters[i]))] 
              for i in range(len(l_clusters))}

# Let's create a dataframe containing the centers and their coordinates
df_centers = spark.sparkContext.parallelize([(k,)+(v,) for k,v in 
d_clusters.items()]).toDF(['prediction','center'])

df_pred = df_pred.withColumn('prediction',F.col('prediction').cast(IntegerType()))
df_pred = df_pred.join(df_centers,on='prediction',how='left')
df_pred.show()


+----------+----------+------------+
|prediction|  features|      center|
+----------+----------+------------+
|         0| [8.0,9.0]|[9.5, 4.625]|
|         0|[10.0,1.5]|[9.5, 4.625]|
|         0| [9.0,8.0]|[9.5, 4.625]|
|         0|[11.0,0.0]|[9.5, 4.625]|
|         1| [1.0,1.0]|  [0.5, 0.5]|
|         1| [0.0,0.0]|  [0.5, 0.5]|
+----------+----------+------------+

Finally we can use a udf to compute the distance between the column features and center coordinates :

get_dist = F.udf(lambda features, center : 
                 float(features.squared_distance(center)),FloatType())
df_pred = df_pred.withColumn('dist',get_dist(F.col('features'),F.col('center')))
df_pred.show()

+----------+----------+------------+---------+
|prediction|  features|      center|     dist|
+----------+----------+------------+---------+
|         0|[11.0,0.0]|[9.5, 4.625]|23.640625|
|         0| [9.0,8.0]|[9.5, 4.625]|11.640625|
|         0| [8.0,9.0]|[9.5, 4.625]|21.390625|
|         0|[10.0,1.5]|[9.5, 4.625]|10.015625|
|         1| [1.0,1.0]|  [0.5, 0.5]|      0.5|
|         1| [0.0,0.0]|  [0.5, 0.5]|      0.5|
+----------+----------+------------+---------+

answered Oct 16 '22 13:10

plalanne

Related questions
                            
                                PySpark timeout trying to repartition/write to parquet (Futures timed out after [300 seconds])?
                            
                                Display PySpark Dataframe as HTML Table in Juypyter Notebook
                            
                                pyspark - getting Latest partition from Hive partitioned column logic
                            
                                Get name / alias of column in PySpark
                            
                                write spark dataframe as array of json (pyspark)
                            
                                ERROR: Unable to find py4j, your SPARK_HOME may not be configured correctly
                            
                                No module named numpy when spark-submitting
                            
                                Joining two DataFrames from the same source
                            
                                Connecting from Spark/pyspark to PostgreSQL
                            
                                How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?
                            
                                Why does pyspark give "we couldn't find any external IP address" on macOS?
                            
                                Towards limiting the big RDD
                            
                                How to load table from SQLLite db file from PySpark?
                            
                                Pyspark, initializing spark programmatically : IllegalArgumentException: Missing application resource
                            
                                Fuzzy matching a word inside a pyspark dataframe string
                            
                                Spark Dataframe hanging on save
                            
                                ERROR WHILE RUNNING collect() in PYSPARK
                            
                                Stateful udfs in spark sql, or how to obtain mapPartitions performance benefit in spark sql?
                            
                                Cannot load pipeline model from pyspark
                            
                                prioritizing partitions / task execution in spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: K means result with distance or deviation?

Tags:

pyspark

cqcn1991

People also ask

1 Answers

plalanne

Recent Activity

Donate For Us