I have gone through the documents about n_features
and centers
parameters in make_blobs
function in SciKit
. However, every explanation I've seen doesn't sound so clear to me since I am new to SciKit
and Mathematics. I am wondering what do these two parameters: n_features
, centers
do in make_blobs
function as below.
make_blobs(n_samples=50, n_features=2, centers=2, random_state=75)
Thank you in advance.
make_blobs() is used for generating isotropic Gaussian blobs for clustering. the param cluster_std is the standard deviation of the clusters.
Now, as you mentioned, n_features determined how many columns or features the generated datasets will have. In machine learning, features correspond to numerical characteristics data.
The make_blobs() function can be used to generate blobs of points with a Gaussian distribution. You can control how many blobs to generate and the number of samples to generate, as well as a host of other properties.
The make_blobs
function is a part of sklearn.datasets.samples_generator
. All methods in the package, help us to generate data samples or datasets. In machine learning, which scikit-learn all about, datasets are used to evaluate performance of machine learning models. This is an example on how to evaluate a KNN classifier:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_blobs(n_features=2, centers=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print('accuracy: {}%'.format(acc))
Now, as you mentioned, n_features
determined how many columns or features the generated datasets will have. In machine learning, features correspond to numerical characteristics data. For example, in Iris Dataset, there are 4 features (Sepal Length, Sepal Width, Petal Length and Petal Width) so there are 4 numerical columns in the dataset. So by increasing n_features
in make_blobs
, we are adding more features hence increase the complexity of generated dataset.
As for the centers
, it is easier to understand by visualizing the generated dataset. I use matplotlib
to help us on that:
from sklearn.datasets.samples_generator import make_blobs
import matplot
# plot 1
X, y = make_blobs(n_features=2, centers=1)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.savefig('centers_1.png')
plt.title('centers = 1')
# plot 2
X, y = make_blobs(n_features=2, centers=2)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 2')
# plot 3
X, y = make_blobs(n_features=2, centers=3)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 3')
plt.show()
If you run the code above you can easily see that centers
corresponds to number of classes generated in the data. It uses centers as a term because samples that belong to same class, tend to gather close to a center (coordinate).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With