Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mean() of column in pandas DataFrame returning inf: how can I solve this?

I'm trying to implement some machine learning algorithms, but I'm having some difficulties putting the data together.

In the example below, I load a example data-set from UCI, remove lines with missing data (thanks to the help from a previous question), and now I would like to try to normalize the data.

For many datasets, I just used:

valores = (valores - valores.mean()) / (valores.std())

But for this particular dataset the approach above doesn't work. The problem is that the mean function is returning inf, perhaps due to a precision issue. See the example below:

bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)

for col in bcw.columns:
    if bcw[col].dtype != 'int64':
        print "Removendo possivel '?' na coluna %s..." % col
        bcw = bcw[bcw[col] != '?']

valores = bcw.iloc[:,1:10]
#mean return inf
print  valores.iloc[:,5].mean()

My question is how to deal with this. It seems that I need to change the type of this column, but I don't know how to do it.

like image 949
Augusto Ribas Avatar asked Dec 24 '22 18:12

Augusto Ribas


1 Answers

not so familiar with pandas but if you convert to a numpy array it works, try

np.asarray(valores.iloc[:,5], dtype=np.float).mean()
like image 180
Dave Avatar answered Feb 06 '23 17:02

Dave