Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using mca package in Python

I am trying to use the mca package to do multiple correspondence analysis in Python.

I am a bit confused as to how to use it. With PCA I would expect to fit some data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transform unseen data.

Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E, .L, .K, .k etc).

So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like

import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))

from what I can gather

ca.fs_r(1)

is the transformation of the data in df and

ca.L

is supposed to be the eigenvalues (although I get a vector of 1s that is one element fewer that my number of features?).

now if I had some more data with the same features, let's say df_new and assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1) for the new data

like image 706
Dan Avatar asked Jan 30 '18 12:01

Dan


People also ask

What is the difference between PCA and MCA?

Put in very simple terms, Multiple Correspondence Analysis (MCA) is to qualitative data, as Principal Component Analysis (PCA) is to quantitative data.

How does multiple correspondence analysis work?

In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space.

How does correspondence analysis work?

A correspondence analysis uses a contingency table—a table of frequencies—that shows how variables distribute categories. The data in the table undergoes a series of transformations in relation to the data around it to produce relational data. The resulting data is then graphed to show those relationships visually.


2 Answers

One other method is to use the library prince which enables easy usage of tools such as:

  1. Multiple correspondence analysis (MCA)
  2. Principal component analysis (PCA)
  3. Multiple factor analysis (MFA)

You can begin first by installing with:

pip install --user prince

To use MCA, it is fairly simple and can be done in a couple of steps (just like sklearn PCA method.) We first build our dataframe.

import pandas as pd 
import prince

X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']

print(X.head())

mca = prince.MCA()

# outputs
>>     Color   Size   Action    Age Inflated
   0  YELLOW  SMALL  STRETCH  ADULT        T
   1  YELLOW  SMALL  STRETCH  CHILD        F
   2  YELLOW  SMALL      DIP  ADULT        F
   3  YELLOW  SMALL      DIP  CHILD        F
   4  YELLOW  LARGE  STRETCH  ADULT        T

Followed by calling the fit and transform method.

mca = mca.fit(X) # same as calling ca.fs_r(1)
mca = mca.transform(X) # same as calling ca.fs_r_sup(df_new) for *another* test set.
print(mca)

# outputs
>>         0             1
0   0.705387  8.373126e-15
1  -0.386586  8.336230e-15
2  -0.386586  6.335675e-15
3  -0.852014  6.726393e-15
4   0.783539 -6.333333e-01
5   0.783539 -6.333333e-01
6  -0.308434 -6.333333e-01
7  -0.308434 -6.333333e-01
8  -0.773862 -6.333333e-01
9   0.783539  6.333333e-01
10  0.783539  6.333333e-01
11 -0.308434  6.333333e-01
12 -0.308434  6.333333e-01
13 -0.773862  6.333333e-01
14  0.861691 -5.893240e-15
15  0.861691 -5.893240e-15
16 -0.230282 -5.930136e-15
17 -0.230282 -7.930691e-15
18 -0.695710 -7.539973e-15

You can even print out the picture diagram of it, since it incorporates matplotlib library.

ax = mca.plot_coordinates(
     X=X,
     ax=None,
     figsize=(6, 6),
     show_row_points=True,
     row_points_size=10,
     show_row_labels=False,
     show_column_points=True,
     column_points_size=30,
     show_column_labels=False,
     legend_n_cols=1
     )

ax.get_figure().savefig('images/mca_coordinates.svg')

mca

like image 159
Axois Avatar answered Sep 21 '22 20:09

Axois


The documentation of the mca package is not very clear with that regard. However, there are a few cues which suggest that ca.fs_r_sup(df_new) should be used to project new (unseen) data onto the factors obtained in the analysis.

  1. The package author refers to new data as supplementary data which is the terminology used in following paper: Abdi, H., & Valentin, D. (2007). Multiple correspondence analysis. Encyclopedia of measurement and statistics, 651-657.
  2. The package has only two functions which accept new data as parameter DF: fs_r_sup(self, DF, N=None) and fs_c_sup(self, DF, N=None). The latter is to find the column factor scores.
  3. The usage guide demonstrates this based on a new data frame which has not been used throughout the component analysis.
like image 43
Jan Trienes Avatar answered Sep 20 '22 20:09

Jan Trienes