I am running k-means using sklearn but has been getting runtime warning. Can you please explain what's happening? Below is a sample code for reproducibility:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
col1 = np.random.normal(loc=0, scale=1, size=1000)
col2 = np.random.normal(loc=1, scale=1, size=1000)
col3 = np.random.normal(loc=2, scale=4, size=1000)
col4 = np.random.normal(loc=3, scale=3, size=1000)
df = pd.DataFrame(list(zip(col1, col2, col3, col4)))
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, random_state=0)
kmeans.fit(df)
The warnings are:
miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: divide by zero encountered in matmul
ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: overflow encountered in matmul
ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/utils/extmath.py:203: RuntimeWarning: invalid value encountered in matmul
ret = a @ b
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: divide by zero encountered in matmul
current_pot = closest_dist_sq @ sample_weight
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: overflow encountered in matmul
current_pot = closest_dist_sq @ sample_weight
miniconda3/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py:237: RuntimeWarning: invalid value encountered in matmul
current_pot = closest_dist_sq @ sample_weight
When the scale of the input data features has a high variability, it can cause numerical instability issues with k-means clustering.
For example, if a feature has values in the thousands, and another has values between 0 and 1, the large feature can "dominate" the internal distance calculations of the k-means function, leading to very large or small numbers during matrix multiplications. Data scaling can fix such issues. Sklearn comes with a builtin StandardScaler that scales according to unit variance, ensuring all features have a more equal contribution to distance calculations.
You can import as follows:
from sklearn.preprocessing import StandardScaler
And you can utilize it as follows:
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
kmeans = Kmeans(...)
kmeans.fit(scaled_df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With