Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Runtime warning in sklearn KMeans

I am running k-means using sklearn but has been getting runtime warning. Can you please explain what's happening? Below is a sample code for reproducibility:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

col1 = np.random.normal(loc=0, scale=1, size=1000)
col2 = np.random.normal(loc=1, scale=1, size=1000)
col3 = np.random.normal(loc=2, scale=4, size=1000)
col4 = np.random.normal(loc=3, scale=3, size=1000)

df = pd.DataFrame(list(zip(col1, col2, col3, col4)))

scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, random_state=0)
kmeans.fit(df)

The warnings are:

miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: divide by zero encountered in matmul
  ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: overflow encountered in matmul
  ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: invalid value encountered in matmul
  ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: divide by zero encountered in matmul
  current_pot = closest_dist_sq @ sample_weight
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: overflow encountered in matmul
  current_pot = closest_dist_sq @ sample_weight
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: invalid value encountered in matmul
  current_pot = closest_dist_sq @ sample_weight
like image 341
useryk Avatar asked May 17 '26 19:05

useryk


1 Answers

When the scale of the input data features has a high variability, it can cause numerical instability issues with k-means clustering.

For example, if a feature has values in the thousands, and another has values between 0 and 1, the large feature can "dominate" the internal distance calculations of the k-means function, leading to very large or small numbers during matrix multiplications. Data scaling can fix such issues. Sklearn comes with a builtin StandardScaler that scales according to unit variance, ensuring all features have a more equal contribution to distance calculations.

You can import as follows:

from sklearn.preprocessing import StandardScaler

And you can utilize it as follows:

scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

kmeans = Kmeans(...)
kmeans.fit(scaled_df)
like image 181
alien_jedi Avatar answered May 20 '26 08:05

alien_jedi