Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading SKLearn cancer dataset into Pandas DataFrame

I'm trying to load a sklearn.dataset, and missing a column, according to the keys (target_names, target & DESCR). I have tried various methods to include the last column, but with errors.

 import numpy as np
 import pandas as pd
 from sklearn.datasets import load_breast_cancer

 cancer = load_breast_cancer()
 print cancer.keys()

the keys are ['target_names', 'data', 'target', 'DESCR', 'feature_names']

 data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
 print data.describe()

with the code above, it only returns 30 column, when I need 31 columns. What is the best way load scikit-learn datasets into pandas DataFrame.

like image 286
pythonhunter Avatar asked Jun 03 '17 04:06

pythonhunter


Video Answer


3 Answers

Another option, but a one-liner, to create the dataframe including the features and target variables is:

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                  columns= np.append(cancer['feature_names'], ['target']))
like image 155
jedge Avatar answered Oct 17 '22 23:10

jedge


If you want to have a target column you will need to add it because it's not in cancer.data. cancer.target has the column with 0 or 1, and cancer.target_names has the label. I hope the following is what you want:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
print cancer.keys()

data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()

data = data.assign(target=pd.Series(cancer.target))
print data.describe()

# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.
like image 5
Y. Luo Avatar answered Oct 17 '22 23:10

Y. Luo


This works too, also using pd.Series.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
print cancer.keys()

data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)

print data.keys()
print data.shape
like image 3
pythonhunter Avatar answered Oct 18 '22 00:10

pythonhunter