As part of the k-means algorithm, I am trying to update the cluster each item belongs to as in the below query. Problem is, I cannot seem to be able to reference table d inside the nested query.
UPDATE algorithms.km_crimes d SET cluster_id = c.id
FROM (SELECT id FROM algorithms.km_cluster_centres c
ORDER BY |/ (POW(d.latitude-c.latitude,2)+POW(d.longitude-c.longitude,2))
ASC LIMIT 1) AS c
WHERE d.cluster_id IS DISTINCT FROM c.id;
Can anyone suggest how to restructure the query? I've tried too many modifications to count
Based on the MySQL example which you are converting, you don't need to change this first query at all.
The algorithm doesn't care how many times a cluster_id is reassigned in each iteration; it just needs to stop when none of the cluster centers are moved. And luckily, the second query is much easier to fix.
This appears to work:
CREATE TABLE km_data (id serial, cluster_id int, lat double precision, lng double precision);
CREATE TABLE km_clusters (id serial, lat double precision, lng double precision);
CREATE OR REPLACE FUNCTION kmeans(k int) RETURNS VOID LANGUAGE plpgsql AS $$
BEGIN
TRUNCATE km_clusters;
INSERT INTO km_clusters (lat, lng)
SELECT lat, lng FROM km_data
ORDER BY random() LIMIT k;
LOOP
UPDATE km_data d SET cluster_id = (
SELECT id FROM km_clusters c
ORDER BY |/(POW(d.lat-c.lat,2)+POW(d.lng-c.lng,2)) LIMIT 1
);
UPDATE km_clusters c
SET lat=d.lat, lng=d.lng
FROM (
SELECT
cluster_id,
AVG(lat) AS lat,
AVG(lng) AS lng
FROM km_data
GROUP BY cluster_id
) d
WHERE
c.id=d.cluster_id AND
ABS(c.lat-d.lat) < 0.001 AND
ABS(c.lng-d.lng) < 0.001;
EXIT WHEN NOT FOUND;
END LOOP;
END $$;
You can tune the numbers in the final WHERE clause if you want more precision, though this looks like a pretty imprecise algorithm to begin with.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With