I'm totally new to python. I've used some code found online and I tried to work on it. So I'm creating a text-document-matrix and I want to add some extra features before training a logistic regression model. Although I've checked my data with R and I get no error, when I run the logistic regression I get the error "ValueError: Array contains NaN or infinity." I'm not getting the same error when I do not add my own features. My features are in the file "toPython.txt". Mind the two calls to assert_all_finite function that returns "None"! Below is the code I use and the output I get: <pre class="prettyprint"><code>def _assert_all_finite(X): if X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all(): raise ValueError("Array contains NaN or infinity.") def assert_all_finite(X): _assert_all_finite(X.data if sparse.issparse(X) else X) def main(): print "loading data.." traindata = list(np.array(p.read_table('data/train.tsv'))[:,2]) testdata = list(np.array(p.read_table('data/test.tsv'))[:,2]) y = np.array(p.read_table('data/train.tsv'))[:,-1] tfv = TfidfVectorizer(min_df=12, max_features=None, strip_accents='unicode', analyzer='word',stop_words='english', lowercase=True, token_pattern=r'\w{1,}',ngram_range=(1, 1), use_idf=1,smooth_idf=1,sublinear_tf=1) rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None) X_all = traindata + testdata lentrain = len(traindata) f = np.array(p.read_table('data/toPython.txt')) indices = np.nonzero(~np.isnan(f)) b = csr_matrix((f[indices], indices), shape=f.shape, dtype='float') print b.get_shape **print assert_all_finite(b)** print "fitting pipeline" tfv.fit(X_all) print "transforming data" X_all = tfv.transform(X_all) print X_all.get_shape X_all=hstack( [X_all,b], format='csr' ) print X_all.get_shape **print assert_all_finite(X_all)** X = X_all[:lentrain] print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc')) </code></pre> And the output is: <pre class="prettyprint"><code>loading data.. <bound method csr_matrix.get_shape of <10566x40 sparse matrix of type '<type 'numpy.float64'>' with 422640 stored elements in Compressed Sparse Row format>> **None** fitting pipeline transforming data <bound method csr_matrix.get_shape of <10566x13913 sparse matrix of type '<type 'numpy.float64'>' with 1450834 stored elements in Compressed Sparse Row format>> <bound method csr_matrix.get_shape of <10566x13953 sparse matrix of type '<type 'numpy.float64'>' with 1873474 stored elements in Compressed Sparse Row format>> **None** 3 Fold CV Score: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 523, in runfile execfile(filename, namespace) File "C:\Users\Stergios\Documents\Python\beat_bench.py", line 100, in <module> main() File "C:\Users\Stergios\Documents\Python\beat_bench.py", line 97, in main print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc')) File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1152, in cross_val_score for train, test in cv) File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 517, in __call__ self.dispatch(function, args, kwargs) File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 312, in dispatch job = ImmediateApply(func, args, kwargs) File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 136, in __init__ self.results = func(*args, **kwargs) File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1064, in _cross_val_score score = scorer(estimator, X_test, y_test) File "C:\Python27\lib\site-packages\sklearn\metrics\scorer.py", line 141, in __call__ return self._sign * self._score_func(y, y_pred, **self._kwargs) File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 403, in roc_auc_score fpr, tpr, tresholds = roc_curve(y_true, y_score) File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 672, in roc_curve fps, tps, thresholds = _binary_clf_curve(y_true, y_score, pos_label) File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 504, in _binary_clf_curve y_true, y_score = check_arrays(y_true, y_score) File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 233, in check_arrays _assert_all_finite(array) File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 27, in _assert_all_finite raise ValueError("Array contains NaN or infinity.") ValueError: Array contains NaN or infinity. </code></pre> Any ideas? Thank you!!

I usually use this function: <pre class="prettyprint"><code>x = np.nan_to_num(x) </code></pre> Replace nan with zero and inf with finite numbers.

How to fix "NaN or infinity" issue for sparse matrix in python?

Tags:

I'm totally new to python. I've used some code found online and I tried to work on it. So I'm creating a text-document-matrix and I want to add some extra features before training a logistic regression model.

Although I've checked my data with R and I get no error, when I run the logistic regression I get the error "ValueError: Array contains NaN or infinity." I'm not getting the same error when I do not add my own features. My features are in the file "toPython.txt".

Mind the two calls to assert_all_finite function that returns "None"!

Below is the code I use and the output I get:

def _assert_all_finite(X):
if X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all():
    raise ValueError("Array contains NaN or infinity.")

def assert_all_finite(X):
_assert_all_finite(X.data if sparse.issparse(X) else X)

def main():

print "loading data.."
traindata = list(np.array(p.read_table('data/train.tsv'))[:,2])
testdata = list(np.array(p.read_table('data/test.tsv'))[:,2])
y = np.array(p.read_table('data/train.tsv'))[:,-1]

tfv = TfidfVectorizer(min_df=12,  max_features=None, strip_accents='unicode',  
    analyzer='word',stop_words='english', lowercase=True,
    token_pattern=r'\w{1,}',ngram_range=(1, 1), use_idf=1,smooth_idf=1,sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                         C=1, fit_intercept=True, intercept_scaling=1.0, 
                         class_weight=None, random_state=None)

X_all = traindata + testdata
lentrain = len(traindata)

f = np.array(p.read_table('data/toPython.txt'))
indices = np.nonzero(~np.isnan(f))
b = csr_matrix((f[indices], indices), shape=f.shape, dtype='float')

print b.get_shape
**print assert_all_finite(b)**
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
print X_all.get_shape

X_all=hstack( [X_all,b], format='csr' )
print X_all.get_shape

**print assert_all_finite(X_all)**

X = X_all[:lentrain]
print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc'))

And the output is:

loading data..
<bound method csr_matrix.get_shape of <10566x40 sparse matrix of type '<type 'numpy.float64'>'
with 422640 stored elements in Compressed Sparse Row format>>
**None**
fitting pipeline
transforming data
<bound method csr_matrix.get_shape of <10566x13913 sparse matrix of type '<type 'numpy.float64'>'
with 1450834 stored elements in Compressed Sparse Row format>>
<bound method csr_matrix.get_shape of <10566x13953 sparse matrix of type '<type 'numpy.float64'>'
with 1873474 stored elements in Compressed Sparse Row format>>
**None**
3 Fold CV Score: 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 523, in runfile
execfile(filename, namespace)
File "C:\Users\Stergios\Documents\Python\beat_bench.py", line 100, in <module>
main()
File "C:\Users\Stergios\Documents\Python\beat_bench.py", line 97, in main
print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc'))
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1152, in cross_val_score
for train, test in cv)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 517, in __call__
self.dispatch(function, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 312, in dispatch
job = ImmediateApply(func, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 136, in __init__
self.results = func(*args, **kwargs)
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1064, in _cross_val_score
score = scorer(estimator, X_test, y_test)
File "C:\Python27\lib\site-packages\sklearn\metrics\scorer.py", line 141, in __call__
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 403, in roc_auc_score
fpr, tpr, tresholds = roc_curve(y_true, y_score)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 672, in roc_curve
fps, tps, thresholds = _binary_clf_curve(y_true, y_score, pos_label)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 504, in _binary_clf_curve
y_true, y_score = check_arrays(y_true, y_score)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 233, in check_arrays
_assert_all_finite(array)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 27, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Any ideas? Thank you!!

813

asked Sep 22 '13 19:09

Stergios

2 Answers

I found that doing the following, assuming sm is a sparse matrix (mine was CSR matrix, please say something about other types if you know!) worked quite nicely:

Manually replacing nans with appropriate numbers in data vector:

In [4]: np.isnan(matrix.data).any()
Out[4]: True

In [5]: sm.data.shape
Out[5]: (553555,)

In [6]: sm.data = np.nan_to_num(sm.data)

In [7]: np.isnan(matrix.data).any()
Out[7]: False

In [8]: sm.data.shape
Out[8]: (553555,)

So we no longer have nan values, but matrix explicitly encodes those zeros as valued indices.

Removing explicitly encoded zero values from sparse matrix:

In [9]: sm.eliminate_zeros()

In [10]: sm.data.shape
Out[10]: (551391,)

And our matrix actually got smaller now, yay!

164

answered Sep 18 '22 11:09

NirIzr

I usually use this function:

x = np.nan_to_num(x)

Replace nan with zero and inf with finite numbers.

answered Sep 20 '22 11:09

lucky6qi

Related questions
                            
                                What's the difference between homogeneous and heterogeneous sequences in Python? [duplicate]
                            
                                performance of insert with python and sqlite3
                            
                                Django: 'module' object has no attribute 'index'
                            
                                Simple Python Battleship game
                            
                                Unpacking nested C structs in Python
                            
                                Plotting a polynomial in Python
                            
                                Should the order of import statements matter when importing a .so?
                            
                                Add title and legend to igraph plots
                            
                                Adding 1 to a set containing True does not work
                            
                                Django test - How to send a HTTP Post Multipart with JSON
                            
                                Get Flask to show image not located in the static directory
                            
                                concatenating arrays in python like matlab without knowing the size of the output array
                            
                                crontab: python script being run but does not execute OS Commands
                            
                                ttk.Combobox glitch when state is read-only and out of focus
                            
                                Flask: login session times out too soon
                            
                                How is this sorting code working?
                            
                                Wrapping a pre-initialized pointer in a cython class
                            
                                How to use named colors in wxpython?
                            
                                Matrix solution with Numpy: no solution?
                            
                                Python: How to get local maxima values from 1D-array or list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to fix "NaN or infinity" issue for sparse matrix in python?

Tags:

python

nan

scikit-learn

Stergios

People also ask

2 Answers

NirIzr

lucky6qi

Recent Activity

Donate For Us