I am writing a data mining algorithm in Scala and I want to write the Euclidean Distance function for a given test and several train instances. I have an Array[Array[Double]]
with test and train instances. I have a method which loops through each test instance against all training instances and calculates distances between the two (picking one test and train instance per iteration) and returns a Double
.
Say, for example, I have the following data points:
testInstance = Array(Array(3.2, 2.1, 4.3, 2.8))
trainPoints = Array(Array(3.9, 4.1, 6.2, 7.3), Array(4.5, 6.1, 8.3, 3.8), Array(5.2, 4.6, 7.4, 9.8), Array(5.1, 7.1, 4.4, 6.9))
I have a method stub (highlighting the distance function) which returns neighbours around a given test instance:
def predictClass(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]], k: Int): Array[Double] = {
for(testInstance <- testPoints)
{
for(trainInstance <- trainPoints)
{
for(i <- 0 to k)
{
distance = euclideanDistanceBetween(testInstance, trainInstance) //need help in defining this function
}
}
}
return distance
}
I know how to write a generic Euclidean Distance formula as:
math.sqrt(math.pow((x1 - y1), 2) + math.pow((x2 - y2), 2))
I have some pseudo steps as to what I want the method to do with a basic definition of the function:
def distanceBetween(testInstance: Array[Double], trainInstance: Array[Double]): Double = {
// subtract each element of trainInstance with testInstance
// for example,
// iteration 1 will do [Array(3.9, 4.1, 6.2, 7.3) - Array(3.2, 2.1, 4.3, 2.8)]
// i.e. sqrt(3.9-3.2)^2+(4.1-2.1)^2+(6.2-4.3)^2+(7.3-2.8)^2
// return result
// iteration 2 will do [Array(4.5, 6.1, 8.3, 3.8) - Array(3.2, 2.1, 4.3, 2.8)]
// i.e. sqrt(4.5-3.2)^2+(6.1-2.1)^2+(8.3-4.3)^2+(3.8-2.8)^2
// return result, and so on......
}
How can I write this in code?
So the formula you put in only works for two-dimensional vectors. You have four dimensions, but you should probably write your function to be flexible on this. So check out this formula.
So what you really want to say is:
for each position i:
subtract the ith element of Y from the ith element of X
square it
add all of those up
square root the whole thing
To make this more functional-programming style it will be more like:
square root the:
sum of:
zip X and Y into pairs
for each pair, square the difference
So that would look like:
import math._
def distance(xs: Array[Double], ys: Array[Double]) = {
sqrt((xs zip ys).map { case (x,y) => pow(y - x, 2) }.sum)
}
val testInstances = Array(Array(5.0, 4.8, 7.5, 10.0), Array(3.2, 2.1, 4.3, 2.8))
val trainPoints = Array(Array(3.9, 4.1, 6.2, 7.3), Array(4.5, 6.1, 8.3, 3.8), Array(5.2, 4.6, 7.4, 9.8), Array(5.1, 7.1, 4.4, 6.9))
distance(testInstances.head, trainPoints.head)
// 3.2680269276736382
As for predicting the class, you can make that more functional too, but it's unclear what the Double is that you are intending to return. It seems like you would want to predict the class for each test instance? Maybe choosing the class c
corresponding to the nearest training point?
def findNearestClasses(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]]): Array[Int] = {
testPoints.map { testInstance =>
trainPoints.zipWithIndex.map { case (trainInstance, c) =>
c -> distance(testInstance, trainInstance)
}.minBy(_._2)._1
}
}
findNearestClasses(testInstances, trainPoints)
// Array(2, 0)
Or maybe you want the k
-nearest neighbors:
def findKNearestClasses(testPoints: Array[Array[Double]], trainPoints: Array[Array[Double]], k: Int): Array[Int] = {
testPoints.map { testInstance =>
val distances =
trainPoints.zipWithIndex.map { case (trainInstance, c) =>
c -> distance(testInstance, trainInstance)
}
val classes = distances.sortBy(_._2).take(k).map(_._1)
val classCounts = classes.groupBy(identity).mapValues(_.size)
classCounts.maxBy(_._2)._1
}
}
findKNearestClasses(testInstances, trainPoints)
// Array(2, 1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With