VIF function returns all 'inf' values

Question

I'm handling with multicollinearity problem with variance_inflation_factor() function.

But after running the function, I found that the function returned all the scores as infinite values.

Here's my code:

from rdkit import Chem
import pandas as pd
import numpy as np
from numpy import array

data = pd.read_csv('Descriptors_raw.csv')
class_ = pd.read_csv('class_file.csv')
class_tot = pd.read_csv('class_total.csv')

mols_A1 = Chem.SDMolSupplier('finaldata_A1.sdf')
mols_A2 = Chem.SDMolSupplier('finaldata_A2.sdf')
mols_B = Chem.SDMolSupplier('finaldata_B.sdf')
mols_C = Chem.SDMolSupplier('finaldata_C.sdf')

mols = []
mols.extend(mols_A1)
mols.extend(mols_A2)
mols.extend(mols_B)
mols.extend(mols_C)

mols_df = pd.DataFrame(mols)
mols = pd.concat([mols_df, class_tot, data], axis=1)

mols = mols.dropna(axis=0, thresh=1400)
mols.groupby('target_name_quarter').mean()
fill_mean_func = lambda g: g.fillna(g.mean())
mols = mols.groupby('target_name_quarter').apply(fill_mean_func)
molfiles = mols.loc[:, :'target_quarter']
descriptors = mols.loc[:, 'nAcid':'Zagreb']

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
fitted = scaler.fit(descriptors)
descriptors_scaled = scaler.transform(descriptors)
descriptors_scaled = pd.DataFrame(descriptors_scaled, columns=descriptors.columns, index = list(descriptors.index.values))

from sklearn.feature_selection import VarianceThreshold

def variance_threshold_selector(data, threshold):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

descriptors_del_lowvar = variance_threshold_selector(descriptors_scaled, 0.01)
mols = pd.concat([molfiles, descriptors_del_lowvar.loc[:, 'nAcid':'Zagreb']], axis=1)

mols.loc[:, 'nAcid':'Zagreb'].corr()

import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
% matplotlib inline
sns.pairplot(mols[['apol', 'nAtom', 'nHeavyAtom', 'nH', 'nAcid']])

vif = pd.DataFrame()
des = mols.loc[:, 'nAcid':'Zagreb'] 
vif["VIF factor"] = [variance_inflation_factor(des.values, i) for i in range(des.shape[1])]
vif["features"] = des.columns
print(vif)

I used MinMaxScaler() when eliminate low-variance features so as to make all the variables in same range. print(vif) returns a dataframe with all infinite values and I cannot figure out why.

Thank you in advance :)

ellkay666 · Accepted Answer

This shows a perfect correlation between two independent variables. In the case of perfect correlation, we get R2 =1, which lead to 1/(1-R2) infinity. To solve this problem we need to drop one of the variables from the dataset which is causing this perfect multicollinearity.

VIF function returns all 'inf' values

Tags:

python

dataframe

infinite

feature-selection

Myon

1 Answers

ellkay666

Recent Activity

Donate For Us

VIF function returns all 'inf' values

Tags:

python

dataframe

infinite

feature-selection

Myon

1 Answers

ellkay666

Related questions

Recent Activity

Donate For Us