Action To cluster points based on distance and label using connected components.
Problem The back and forth switching between NetworkX nodes storage of attributes and Pandas DataFrame
Tried Using different functions like Scikit NearestNeighbours, however resulting in the same back and forth moving of data.
Question Is there a simpler way to perform this connected components operation?
Example
import numpy as np
import pandas as pd
import dask.dataframe as dd
import networkx as nx
from scipy import spatial
#generate example dataframe
pdf = pd.DataFrame({'x':[1.0,2.0,3.0,4.0,5.0],
'y':[1.0,2.0,3.0,4.0,5.0],
'z':[1.0,2.0,3.0,4.0,5.0],
'label':[1,2,1,2,1]},
index=[1, 2, 3, 4, 5])
df = dd.from_pandas(pdf, npartitions = 2)
object_id = 0
def cluster(df, object_id=object_id):
# create kdtree
tree = spatial.cKDTree(df[['x', 'y', 'z']])
# get neighbours within distance for every point, store in dataframe as edges
edges = pd.DataFrame({'src':[], 'tgt':[]}, dtype=int)
for source, target in enumerate(tree.query_ball_tree(tree, r=2)):
target.remove(source)
if target:
edges = edges.append(pd.DataFrame({'src':[source] * len(target), 'tgt':target}), ignore_index=True)
# create graph for points using edges from Balltree query
G = nx.from_pandas_dataframe(edges, 'src', 'tgt')
for i in sorted(G.nodes()):
G.node[i]['label'] = nodes.label[i]
G.node[i]['x'] = nodes.x[i]
G.node[i]['y'] = nodes.y[i]
G.node[i]['z'] = nodes.z[i]
# remove edges between points of different classes
G.remove_edges_from([(u,v) for (u,v) in G.edges_iter() if G.node[u]['label'] != G.node[v]['label']])
# find connected components, create dataframe and assign object id
components = list(nx.connected_component_subgraphs(G))
df_objects = pd.DataFrame()
for c in components:
df_object = pd.DataFrame([[i[0], i[1]['x'], i[1]['y'], i[1]['z'], i[1]['label']] for i in c.nodes(data=True)]
, columns=['point_id', 'x', 'y', 'z', 'label']).set_index('point_id')
df_object['object_id'] = object_id
df_objects.append(df_object)
object_id += 1
return df_objects
meta = pd.DataFrame(np.empty(0, dtype=[('x',float),('y',float),('z',float), ('label',int), ('object_id', int)]))
df.apply(cluster, axis=1, meta=meta).head(10)
You can use DBSCAN from scikit-learn
. For min_samples=1
it basically finds connected components. It can use different algorithms for nearest neighbors computation and is configured through the parameter algorithm
(kd-tree
is one of the options).
My other suggestion is to do the computation separately for different labels. This simplifies the implementation and allows for parallelization.
These two suggestions can be implemented as follows:
from sklearn.cluster import DBSCAN
def add_cluster(df, distance):
db = DBSCAN(eps=distance, min_samples=1).fit(df[["x", "y", ...]])
return df.assign(cluster=db.labels_)
df = df.groupby("label", group_keys=False).apply(add_cluster, distance)
It should work for both Pandas and Dask dataframes. Note that the cluster-id starts from 0 for each label, i.e. a cluster is uniquely identified by the tuple (label, cluster)
.
Here is a complete example with artificial data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
plt.rc("figure", dpi=100)
plt.style.use("ggplot")
# create fake data
centers = [[1, 1], [-1, -1], [1, -1], [-1, 1]]
XY, labels = make_blobs(n_samples=100, centers=centers, cluster_std=0.2, random_state=0)
inp = (
pd.DataFrame(XY, columns=["x", "y"])
.assign(label=labels)
.replace({"label": {2: 0, 3: 1}})
)
def add_cluster(df, distance):
db = DBSCAN(eps=distance, min_samples=1).fit(df[["x", "y"]])
return df.assign(cluster=db.labels_)
out = inp.groupby("label", group_keys=False).apply(add_cluster, 0.5)
# visualize
label_marker = ["o", "s"]
ax = plt.gca()
ax.set_aspect('equal')
for (label, cluster), group in out.groupby(["label", "cluster"]):
plt.scatter(group.x, group.y, marker=label_marker[label])
The resulting dataframe looks like this:
The plot of the clusters looks as follows. Labels are indicated by the marker shape and clusters by the color.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With