What does "n_features" and "centers" parameters mean in make_blobs in SciKit?

Tags:

scikit-learn

I have gone through the documents about n_features and centers parameters in make_blobs function in SciKit. However, every explanation I've seen doesn't sound so clear to me since I am new to SciKit and Mathematics. I am wondering what do these two parameters: n_features, centers do in make_blobs function as below.

make_blobs(n_samples=50, n_features=2, centers=2, random_state=75)

Thank you in advance.

718

asked Aug 06 '18 14:08

Backrub32

1 Answers

The make_blobs function is a part of sklearn.datasets.samples_generator. All methods in the package, help us to generate data samples or datasets. In machine learning, which scikit-learn all about, datasets are used to evaluate performance of machine learning models. This is an example on how to evaluate a KNN classifier:

from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_blobs(n_features=2, centers=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print('accuracy: {}%'.format(acc))

Now, as you mentioned, n_features determined how many columns or features the generated datasets will have. In machine learning, features correspond to numerical characteristics data. For example, in Iris Dataset, there are 4 features (Sepal Length, Sepal Width, Petal Length and Petal Width) so there are 4 numerical columns in the dataset. So by increasing n_features in make_blobs, we are adding more features hence increase the complexity of generated dataset.

As for the centers, it is easier to understand by visualizing the generated dataset. I use matplotlib to help us on that:

from sklearn.datasets.samples_generator import make_blobs
import matplot

# plot 1
X, y = make_blobs(n_features=2, centers=1)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.savefig('centers_1.png')
plt.title('centers = 1')

# plot 2    
X, y = make_blobs(n_features=2, centers=2)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 2')

# plot 3
X, y = make_blobs(n_features=2, centers=3)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 3')

plt.show()

centers=1

centers=2

centers=3

If you run the code above you can easily see that centers corresponds to number of classes generated in the data. It uses centers as a term because samples that belong to same class, tend to gather close to a center (coordinate).

answered Sep 18 '22 12:09

Yohanes Gultom

Related questions
                            
                                Can you fix the false negative rate in a classifier in scikit learn
                            
                                Scikit and Pandas: Fitting Large Data
                            
                                How to identify Cluster labels in kmeans scikit learn
                            
                                Graphviz.Source not rendering in Jupyter Notebook
                            
                                sklearn import error - ImportError: cannot import name 'comb'
                            
                                Sklearn: adding lemmatizer to CountVectorizer
                            
                                Scikit learn - fit_transform on the test set
                            
                                How to Find Documents That are in the same Cluster with KMeans
                            
                                Scikit-Learn PCA
                            
                                DBSCAN with custom metric
                            
                                Scikit learn ngram_range purpose in vectorizers
                            
                                CountVectorizer: Vocabulary wasn't fitted
                            
                                SKLearn warning "valid feature names" in version 1.0
                            
                                Deprecation warning in scikit-learn svmlight format loader
                            
                                UserWarning: Label not :NUMBER: is present in all training examples
                            
                                what is the difference between transformer and estimator in sklearn?
                            
                                What is the difference between cross_val_score with scoring='roc_auc' and roc_auc_score?
                            
                                How to use scipy.optimize.minimize function when you want to compute gradient along with the objective function?
                            
                                SGDClassifier vs LogisticRegression with sgd solver in scikit-learn library
                            
                                Memory efficient way to split large numpy array into train and test

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With