Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "n_features" and "centers" parameters mean in make_blobs in SciKit?

Tags:

scikit-learn

I have gone through the documents about n_features and centers parameters in make_blobs function in SciKit. However, every explanation I've seen doesn't sound so clear to me since I am new to SciKit and Mathematics. I am wondering what do these two parameters: n_features, centers do in make_blobs function as below.

make_blobs(n_samples=50, n_features=2, centers=2, random_state=75)

Thank you in advance.

like image 718
Backrub32 Avatar asked Aug 06 '18 14:08

Backrub32


People also ask

What is cluster_std in Make_blobs?

make_blobs() is used for generating isotropic Gaussian blobs for clustering. the param cluster_std is the standard deviation of the clusters.

What does N_features mean?

Now, as you mentioned, n_features determined how many columns or features the generated datasets will have. In machine learning, features correspond to numerical characteristics data.

What is Make_blobs in Sklearn datasets?

The make_blobs() function can be used to generate blobs of points with a Gaussian distribution. You can control how many blobs to generate and the number of samples to generate, as well as a host of other properties.


1 Answers

The make_blobs function is a part of sklearn.datasets.samples_generator. All methods in the package, help us to generate data samples or datasets. In machine learning, which scikit-learn all about, datasets are used to evaluate performance of machine learning models. This is an example on how to evaluate a KNN classifier:

from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_blobs(n_features=2, centers=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print('accuracy: {}%'.format(acc))

Now, as you mentioned, n_features determined how many columns or features the generated datasets will have. In machine learning, features correspond to numerical characteristics data. For example, in Iris Dataset, there are 4 features (Sepal Length, Sepal Width, Petal Length and Petal Width) so there are 4 numerical columns in the dataset. So by increasing n_features in make_blobs, we are adding more features hence increase the complexity of generated dataset.

As for the centers, it is easier to understand by visualizing the generated dataset. I use matplotlib to help us on that:

from sklearn.datasets.samples_generator import make_blobs
import matplot

# plot 1
X, y = make_blobs(n_features=2, centers=1)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.savefig('centers_1.png')
plt.title('centers = 1')

# plot 2    
X, y = make_blobs(n_features=2, centers=2)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 2')

# plot 3
X, y = make_blobs(n_features=2, centers=3)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 3')

plt.show()

centers=1

centers=2

centers=3

If you run the code above you can easily see that centers corresponds to number of classes generated in the data. It uses centers as a term because samples that belong to same class, tend to gather close to a center (coordinate).

like image 70
Yohanes Gultom Avatar answered Sep 18 '22 12:09

Yohanes Gultom