Runtime warning in sklearn KMeans

Question

I am running k-means using sklearn but has been getting runtime warning. Can you please explain what's happening? Below is a sample code for reproducibility:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

col1 = np.random.normal(loc=0, scale=1, size=1000)
col2 = np.random.normal(loc=1, scale=1, size=1000)
col3 = np.random.normal(loc=2, scale=4, size=1000)
col4 = np.random.normal(loc=3, scale=3, size=1000)

df = pd.DataFrame(list(zip(col1, col2, col3, col4)))

scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, random_state=0)
kmeans.fit(df)

The warnings are:

miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: divide by zero encountered in matmul
  ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: overflow encountered in matmul
  ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: invalid value encountered in matmul
  ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: divide by zero encountered in matmul
  current_pot = closest_dist_sq @ sample_weight
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: overflow encountered in matmul
  current_pot = closest_dist_sq @ sample_weight
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: invalid value encountered in matmul
  current_pot = closest_dist_sq @ sample_weight

alien_jedi · Accepted Answer

When the scale of the input data features has a high variability, it can cause numerical instability issues with k-means clustering.

For example, if a feature has values in the thousands, and another has values between 0 and 1, the large feature can "dominate" the internal distance calculations of the k-means function, leading to very large or small numbers during matrix multiplications. Data scaling can fix such issues. Sklearn comes with a builtin StandardScaler that scales according to unit variance, ensuring all features have a more equal contribution to distance calculations.

You can import as follows:

from sklearn.preprocessing import StandardScaler

And you can utilize it as follows:

scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

kmeans = Kmeans(...)
kmeans.fit(scaled_df)

Runtime warning in sklearn KMeans

Tags:

python

k-means

scikit-learn

useryk

1 Answers

alien_jedi

Recent Activity

Donate For Us

Runtime warning in sklearn KMeans

Tags:

python

k-means

scikit-learn

useryk

1 Answers

alien_jedi

Related questions

Recent Activity

Donate For Us