I am trying to use the mca package to do multiple correspondence analysis in Python.
I am a bit confused as to how to use it. With PCA
I would expect to fit some data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transform unseen data.
Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E
, .L
, .K
, .k
etc).
So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like
import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))
from what I can gather
ca.fs_r(1)
is the transformation of the data in df
and
ca.L
is supposed to be the eigenvalues (although I get a vector of 1
s that is one element fewer that my number of features?).
now if I had some more data with the same features, let's say df_new
and assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1)
for the new data
Put in very simple terms, Multiple Correspondence Analysis (MCA) is to qualitative data, as Principal Component Analysis (PCA) is to quantitative data.
In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space.
A correspondence analysis uses a contingency table—a table of frequencies—that shows how variables distribute categories. The data in the table undergoes a series of transformations in relation to the data around it to produce relational data. The resulting data is then graphed to show those relationships visually.
One other method is to use the library prince which enables easy usage of tools such as:
You can begin first by installing with:
pip install --user prince
To use MCA
, it is fairly simple and can be done in a couple of steps (just like sklearn PCA
method.) We first build our dataframe.
import pandas as pd
import prince
X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
print(X.head())
mca = prince.MCA()
# outputs
>> Color Size Action Age Inflated
0 YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH CHILD F
2 YELLOW SMALL DIP ADULT F
3 YELLOW SMALL DIP CHILD F
4 YELLOW LARGE STRETCH ADULT T
Followed by calling the fit
and transform
method.
mca = mca.fit(X) # same as calling ca.fs_r(1)
mca = mca.transform(X) # same as calling ca.fs_r_sup(df_new) for *another* test set.
print(mca)
# outputs
>> 0 1
0 0.705387 8.373126e-15
1 -0.386586 8.336230e-15
2 -0.386586 6.335675e-15
3 -0.852014 6.726393e-15
4 0.783539 -6.333333e-01
5 0.783539 -6.333333e-01
6 -0.308434 -6.333333e-01
7 -0.308434 -6.333333e-01
8 -0.773862 -6.333333e-01
9 0.783539 6.333333e-01
10 0.783539 6.333333e-01
11 -0.308434 6.333333e-01
12 -0.308434 6.333333e-01
13 -0.773862 6.333333e-01
14 0.861691 -5.893240e-15
15 0.861691 -5.893240e-15
16 -0.230282 -5.930136e-15
17 -0.230282 -7.930691e-15
18 -0.695710 -7.539973e-15
You can even print out the picture diagram of it, since it incorporates matplotlib
library.
ax = mca.plot_coordinates(
X=X,
ax=None,
figsize=(6, 6),
show_row_points=True,
row_points_size=10,
show_row_labels=False,
show_column_points=True,
column_points_size=30,
show_column_labels=False,
legend_n_cols=1
)
ax.get_figure().savefig('images/mca_coordinates.svg')
The documentation of the mca package is not very clear with that regard. However, there are a few cues which suggest that ca.fs_r_sup(df_new)
should be used to project new (unseen) data onto the factors obtained in the analysis.
DF
: fs_r_sup(self, DF, N=None)
and fs_c_sup(self, DF, N=None)
. The latter is to find the column factor scores.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With