Easiest way to represent Euclidean Distance in scala

Question

I am writing a data mining algorithm in Scala and I want to write the Euclidean Distance function for a given test and several train instances. I have an Array[Array[Double]] with test and train instances. I have a method which loops through each test instance against all training instances and calculates distances between the two (picking one test and train instance per iteration) and returns a Double.

Say, for example, I have the following data points:

testInstance = Array(Array(3.2, 2.1, 4.3, 2.8))
trainPoints = Array(Array(3.9, 4.1, 6.2, 7.3), Array(4.5, 6.1, 8.3, 3.8), Array(5.2, 4.6, 7.4, 9.8), Array(5.1, 7.1, 4.4, 6.9))

I have a method stub (highlighting the distance function) which returns neighbours around a given test instance:

def predictClass(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]], k: Int): Array[Double] = {

    for(testInstance <- testPoints)
    {
        for(trainInstance <- trainPoints) 
        {
            for(i <- 0 to k) 
            {
                distance = euclideanDistanceBetween(testInstance, trainInstance) //need help in defining this function
            }
        }
    }    
    return distance
}

I know how to write a generic Euclidean Distance formula as:

math.sqrt(math.pow((x1 - y1), 2) + math.pow((x2 - y2), 2))

I have some pseudo steps as to what I want the method to do with a basic definition of the function:

def distanceBetween(testInstance: Array[Double], trainInstance: Array[Double]): Double = {
  // subtract each element of trainInstance with testInstance
  // for example, 
  // iteration 1 will do [Array(3.9, 4.1, 6.2, 7.3) - Array(3.2, 2.1, 4.3, 2.8)]
  // i.e. sqrt(3.9-3.2)^2+(4.1-2.1)^2+(6.2-4.3)^2+(7.3-2.8)^2
  // return result
  // iteration 2 will do [Array(4.5, 6.1, 8.3, 3.8) - Array(3.2, 2.1, 4.3, 2.8)]
  // i.e. sqrt(4.5-3.2)^2+(6.1-2.1)^2+(8.3-4.3)^2+(3.8-2.8)^2
  // return result, and so on......
  }

How can I write this in code?

dhg · Accepted Answer

So the formula you put in only works for two-dimensional vectors. You have four dimensions, but you should probably write your function to be flexible on this. So check out this formula.

So what you really want to say is:

for each position i:
  subtract the ith element of Y from the ith element of X
  square it
add all of those up
square root the whole thing

To make this more functional-programming style it will be more like:

square root the:
  sum of:
    zip X and Y into pairs
    for each pair, square the difference

So that would look like:

import math._

def distance(xs: Array[Double], ys: Array[Double]) = {
  sqrt((xs zip ys).map { case (x,y) => pow(y - x, 2) }.sum)
}

val testInstances = Array(Array(5.0, 4.8, 7.5, 10.0), Array(3.2, 2.1, 4.3, 2.8))
val trainPoints = Array(Array(3.9, 4.1, 6.2, 7.3), Array(4.5, 6.1, 8.3, 3.8), Array(5.2, 4.6, 7.4, 9.8), Array(5.1, 7.1, 4.4, 6.9))

distance(testInstances.head, trainPoints.head)
// 3.2680269276736382

As for predicting the class, you can make that more functional too, but it's unclear what the Double is that you are intending to return. It seems like you would want to predict the class for each test instance? Maybe choosing the class c corresponding to the nearest training point?

def findNearestClasses(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]]): Array[Int] = {
  testPoints.map { testInstance =>
    trainPoints.zipWithIndex.map { case (trainInstance, c) =>
      c -> distance(testInstance, trainInstance)
    }.minBy(_._2)._1
  }
}    

findNearestClasses(testInstances, trainPoints)
// Array(2, 0)

Or maybe you want the k-nearest neighbors:

def findKNearestClasses(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]], k: Int): Array[Int] = {
  testPoints.map { testInstance =>
    val distances = 
      trainPoints.zipWithIndex.map { case (trainInstance, c) =>
        c -> distance(testInstance, trainInstance)
      }
    val classes = distances.sortBy(_._2).take(k).map(_._1)
    val classCounts = classes.groupBy(identity).mapValues(_.size)
    classCounts.maxBy(_._2)._1
  }
}    

findKNearestClasses(testInstances, trainPoints)
// Array(2, 1)

Easiest way to represent Euclidean Distance in scala

Tags:

arrays

scala

Sharadha Jayaraman

1 Answers

dhg

Recent Activity

Donate For Us

Easiest way to represent Euclidean Distance in scala

Tags:

arrays

scala

Sharadha Jayaraman

1 Answers

dhg

Related questions

Recent Activity

Donate For Us