Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

2D Orthogonal projection of vector onto line with numpy yields wrong result

I have 350 document scores that, when I plot them, have this shape:

docScores = [(0, 68.62998962), (1, 60.21374512), (2, 54.72480392), 
             (3, 50.71389389), (4, 49.39723969), ...,  
             (345, 28.3756237), (346, 28.37126923), 
             (347, 28.36397934), (348, 28.35762787), (349, 28.34219933)]

I posted the complete array here on pastebin (it corresponds to the dataPoints list on the code below).

Score distribution

Now, I originally needed to find the elbow point of this L-shape curve, which I found thanks to this post.

Now, on the following plot, the red vector p represents the elbow point. I would like to find the point x=(?,?) (the yellow star) on the vector b which corresponds to the orthogonal projection of p onto b.

enter image description here

The red point on the plot is the one I obtain (which is obviously wrong). I obtain it doing the following:

b_hat = b / np.linalg.norm(b)    #unit vector of b
proj_p_onto_b = p.dot(b_hat)*b_hat
red_point = proj_p_onto_b + s

Now, if the projection of p onto b is defined by the its starting and ending point, namely s and x (the yellow star), it follows that proj_p_onto_b = x - s, therefore x = proj_p_onto_b + s ?

Did I make a mistake here ?

EDIT : In answer to @cxw, here is the code for computing the elbow point :

def findElbowPoint(self, rawDocScores):
    dataPoints = zip(range(0, len(rawDocScores)), rawDocScores)
    s = np.array(dataPoints[0])
    l = np.array(dataPoints[len(dataPoints)-1])
    b_vect = l-s
    b_hat = b_vect/np.linalg.norm(b_vect)
    distances = []
    for scoreVec in dataPoints[1:]:
        p = np.array(scoreVec) - s
        proj = p.dot(b_hat)*b_hat
        d = abs(np.linalg.norm(p - proj)) # orthgonal distance between b and the L-curve
        distances.append((scoreVec[0], scoreVec[1], proj, d))

    elbow_x = max(distances, key=itemgetter(3))[0]
    elbow_y = max(distances, key=itemgetter(3))[1]
    proj = max(distances, key=itemgetter(3))[2]
    max_distance = max(distances, key=itemgetter(3))[3]

    red_point = proj + s

EDIT : Here is the code for the plot :

>>> l_curve_x_values = [x[0] for x in docScores]
>>> l_curve_y_values = [x[1] for x in docScores]
>>> b_line_x_values = [x[0] for x in docScores]
>>> b_line_y_values = np.linspace(s[1], l[1], len(docScores))
>>> p_line_x_values = l_curve_x_values[:elbow_x]
>>> p_line_y_values = np.linspace(s[1], elbow_y, elbow_x)
>>> plt.plot(l_curve_x_values, l_curve_y_values, b_line_x_values, b_line_y_values, p_line_x_values, p_line_y_values)
>>> red_point = proj + s
>>> plt.plot(red_point[0], red_point[1], 'ro')
>>> plt.show()
like image 287
Floran Gmehlin Avatar asked Oct 06 '16 11:10

Floran Gmehlin


2 Answers

If you are using the plot to visually determine if the solution looks correct, you must plot the data using the same scale on each axis, i.e. use plt.axis('equal'). If the axes do not have equal scales, the angles between lines are distorted in the plot.

like image 141
Warren Weckesser Avatar answered Sep 28 '22 13:09

Warren Weckesser


First of all, is the point at ~(50, 37) p or s+p? If p, that might be your problem right there! If the Y component of your p variable is positive, you won't get the results you expect when you do the dot product.

Assuming that point is s+p, if a bit of Post-It scribbling is correct,

p_len = np.linalg.norm(p)
p_hat = p / p_len
red_len = p_hat.dot(b_hat) * p_len   # red_len = |x-s|
    # because p_hat . b_hat = 1 * 1 * cos(angle) = |x-s| / |p|
red_point = s + red_len * b_hat

Not tested! YMMV. Hope this helps.

like image 44
cxw Avatar answered Sep 28 '22 11:09

cxw