I'm running spectral coclustering on this dataset of Jeopardy questions, and there is this frustrating issue I'm facing with the data. Note that I'm only clustering all the values in the 'question' column.
There is apparently a "divide by zero" ValueError occurring when I run biclustering on the dataset.
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:38: RuntimeWarning: divide by zero encountered in true_divide
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/bicluster.py:286: RuntimeWarning: invalid value encountered in multiply
z = np.vstack((row_diag[:, np.newaxis] * u,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The error apparently suggests that there is a NaN or infinite value lurking in my data (which is only the singular column of questions). It's completely text data that I'm handling, and I've already tried most NumPy and Pandas functions for filtering NaNs and inf, as well as many solutions on Stack Overflow. I couldn't find any.
Just to ensure my code doesn't have a fault, the same thing perfectly works on the twenty newsgroups dataset.
Here's the code on Kaggle if you want to run it and see for yourself. However, just in case SO's policies prohibit this, here's the code in a nutshell:
dat = pd.DataFrame(pd.read_csv('../input/jarchive_cleaned.csv'))
qlist = []
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
for row in dat.iterrows():
txt = row[1]['text'].lower()
txt = cleanhtml(txt)
txt = re.sub(r'[^a-z ]',"",txt)
txt = re.sub(r' ',' ',txt)
# txt = ' '.join([stem(w) for w in txt.split(" ")])
qlist.append([txt,row[1]['answer'],row[1]['category']])
print(qlist[:10])
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
Some strings sequence like e.g. 'down out' results in a zero return value from TfidfVectorizer()
. That causes the errors starting with a divide by zero error, which results in inf
values in the mtx
sparse matrix
and this causes the second error.
As a workaround to this problem to remove this sequences or remove the zero matrix elements from the mtx
matrix after it created by TfidfVectorizer.fit_transform()
, which a bit tricky because of the sparse matrix operation.
I made the second solution, as I didn't dived into the original tasks, as follows:
swords = set(stopwords.words('english'))
tv = TfidfVectorizer(stop_words = swords , strip_accents='ascii')
queslst = [q for (q,a,c) in qlist]
qlen = len(set([c for (q,a,c) in qlist]))
mtx = tv.fit_transform(queslst)
indices = []
for i,mx in enumerate(mtx):
if np.sum(mx, axis=1) == 0:
indices.append(i)
mask = np.ones(mtx.shape[0], dtype=bool)
mask[indices] = False
mtx = mtx[mask]
cocluster = SpectralCoclustering(n_clusters=qlen, svd_method='arpack', random_state=0) #
t = time()
cocluster.fit(mtx)
Finally it works. I hope, it helps, good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With