Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn spectral clustering: unable to find NaN lurking in data

I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.

There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.

/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
  row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
  z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.

Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.

Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:

dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))

qlist = []

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

for row in dat.iterrows():
  txt = row[1]['text'].lower()
  txt = cleanhtml(txt)
  txt = re.sub(r'[^a-z ]',"",txt)
  txt = re.sub(r'  ',' ',txt)
#   txt = ' '.join([stem(w) for w in txt.split(" ")])
  qlist.append([txt,row[1]['answer'],row[1]['category']])

print(qlist[:10])

swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))

mtx = tv.fit_transform(queslst)

cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

t = time()
cocluster.fit(mtx)
like image 624
Mayukh Nair Avatar asked Nov 07 '22 00:11

Mayukh Nair


1 Answers

Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer(). That causes the errors starting with a divide by zero error, which results in inf values in the mtx sparse matrix and this causes the second error.

As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx matrix after it created by TfidfVectorizer.fit_transform(), which a bit tricky because of the sparse matrix operation.

I made the second solution, as I didn't dived into the original tasks, as follows:

swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')

queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))

mtx = tv.fit_transform(queslst)

indices = []
for i,mx in enumerate(mtx):
    if np.sum(mx, axis=1) == 0:
        indices.append(i)

mask = np.ones(mtx.shape[0], dtype=bool)
mask[indices] = False
mtx = mtx[mask]        

cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #

t = time()

cocluster.fit(mtx)

Finally it works. I hope, it helps, good luck!

like image 185
Geeocode Avatar answered Nov 15 '22 08:11

Geeocode