Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I work with external data in scikit to perform PCA/LDA?

I'm trying to follow this example, using my own data, to perform linear discriminant analysis and principal component analysis with scikit-learn. My data looks like:

id,mois,prot,fat,ash,sodium,carb,cal,brand
14069,27.82,21.43,44.87,5.11,1.77,0.77,4.93,a
14053,28.49,21.26,43.89,5.34,1.79,1.02,4.84,a
14025,28.35,19.99,45.78,5.08,1.63,0.8,4.95,a
14016,30.55,20.15,43.13,4.79,1.61,1.38,4.74,a
14005,30.49,21.28,41.65,4.82,1.64,1.76,4.67,a
14075,31.14,20.23,42.31,4.92,1.65,1.4,4.67,a
14082,31.21,20.97,41.34,4.71,1.58,1.77,4.63,a
14097,28.76,21.41,41.6,5.28,1.75,2.95,4.72,a
14117,28.22,20.48,45.1,5.02,1.71,1.18,4.93,a
14133,27.72,21.19,45.29,5.16,1.66,0.64,4.95,a
...

brand is the target variable.

Following the example linked above, I've started with this code:

# Import libraries
import pylab as pl
%pylab inline
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.lda import LDA
import pandas as pd 

# Set up the data for the example
pizza_raw        = pd.read_csv("C:\mypath\pizza.csv")
pizza_target     = pizza_raw["brand"]

# select all but the last column as data
pizza_data       = pizza_raw.ix[:,:-1]
pizza_names      = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l"]

# Principal Components
pca = PCA(n_components=2)
X_r = pca.fit(pizza_data).transform(pizza_data)

# Linear Discriminant Analysis
lda = LDA(n_components=2)
X_r2= lda.fit(pizza_data, pizza_target).transform(pizza_data)

# Percentage of variance explained for each components
print('PCA explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))

All of the above works as expected (I think). The next step in the example is to plot the data. (The example works with the IRIS data set...) The example code looks like this

pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    pl.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title('PCA of IRIS dataset')

pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    pl.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title('LDA of IRIS dataset')

pl.show()

Two questions then:

  1. Is my approach to fitting my data to the tutorial correct so far?
  2. How do I adapt the example plot code to produce the same PCA and LDA plots for my data?
like image 542
Clay Avatar asked Mar 21 '26 01:03

Clay


1 Answers

Something like this should work:

color_marker = [(c, m) for c in "rgbc" for m in "123"] #assuming there are 12 types of pizza
for cm, i, target_name in zip(color_marker, range(12), pizza_names):
    pl.scatter(X_r[pizza_target == target_name, 0], X_r[pizza_target == target_name, 1], c=cm[0], marker=cm[1], label=target_name)

Plotting the LDA should work in a similar way. Also check the documentation of pl.scatter (especially the part about colors and markers). Hope that helps.

Pizza PCA

like image 130
Matt Avatar answered Mar 23 '26 16:03

Matt