In sklearn.datasets.make_classification, how is the class y calculated? Let's say I run his:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_classes=2, n_clusters_per_class=1, random_state=0)
What formula is used to come up with the y's from the X's? The documentation touches on this when it talks about the informative features:
The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined in order to add covariance. The clusters are then placed on the vertices of the hypercube.
Thanks,
G
The y is not calculated, simply every row in X gets an associated label in y according to the class the row is in (notice the n_classes variable). Some of these labels are then possibly flipped if flip_y is greater than zero, to create noise in the labeling.
Edit: giving an example
For example, assume you want 2 classes, 1 informative feature, and 4 data points in total. Assume that two class centroids will be generated randomly and they will happen to be 1.0 and 3.0. So every data point that gets generated around the first class (value 1.0) gets the label y=0 and every data point that gets generated around the second class (value 3.0), gets the label y=1. For example X1's for the first class might happen to be 1.2 and 0.7. For the second class, the two points might be 2.8 and 3.1. You now have 4 data points, and you know for which class they were generated, so your final data will be:
Y X1
0 1.2
0 0.7
1 2.8
1 3.1
As you see, there is nothing calculated, you simply assign the class as you randomly generate the data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With