I'm trying to follow this example, using my own data, to perform linear discriminant analysis and principal component analysis with scikit-learn. My data looks like:
id,mois,prot,fat,ash,sodium,carb,cal,brand
14069,27.82,21.43,44.87,5.11,1.77,0.77,4.93,a
14053,28.49,21.26,43.89,5.34,1.79,1.02,4.84,a
14025,28.35,19.99,45.78,5.08,1.63,0.8,4.95,a
14016,30.55,20.15,43.13,4.79,1.61,1.38,4.74,a
14005,30.49,21.28,41.65,4.82,1.64,1.76,4.67,a
14075,31.14,20.23,42.31,4.92,1.65,1.4,4.67,a
14082,31.21,20.97,41.34,4.71,1.58,1.77,4.63,a
14097,28.76,21.41,41.6,5.28,1.75,2.95,4.72,a
14117,28.22,20.48,45.1,5.02,1.71,1.18,4.93,a
14133,27.72,21.19,45.29,5.16,1.66,0.64,4.95,a
...
brand is the target variable.
Following the example linked above, I've started with this code:
# Import libraries
import pylab as pl
%pylab inline
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.lda import LDA
import pandas as pd
# Set up the data for the example
pizza_raw = pd.read_csv("C:\mypath\pizza.csv")
pizza_target = pizza_raw["brand"]
# select all but the last column as data
pizza_data = pizza_raw.ix[:,:-1]
pizza_names = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l"]
# Principal Components
pca = PCA(n_components=2)
X_r = pca.fit(pizza_data).transform(pizza_data)
# Linear Discriminant Analysis
lda = LDA(n_components=2)
X_r2= lda.fit(pizza_data, pizza_target).transform(pizza_data)
# Percentage of variance explained for each components
print('PCA explained variance ratio (first two components): %s'
% str(pca.explained_variance_ratio_))
All of the above works as expected (I think). The next step in the example is to plot the data. (The example works with the IRIS data set...) The example code looks like this
pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
pl.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title('PCA of IRIS dataset')
pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
pl.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title('LDA of IRIS dataset')
pl.show()
Two questions then:
Something like this should work:
color_marker = [(c, m) for c in "rgbc" for m in "123"] #assuming there are 12 types of pizza
for cm, i, target_name in zip(color_marker, range(12), pizza_names):
pl.scatter(X_r[pizza_target == target_name, 0], X_r[pizza_target == target_name, 1], c=cm[0], marker=cm[1], label=target_name)
Plotting the LDA should work in a similar way. Also check the documentation of pl.scatter (especially the part about colors and markers). Hope that helps.

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With