I have an input file which contains floating point numbers to 4 decimal place:
i.e. 13359 0.0000 0.0000 0.0001 0.0001 0.0002` 0.0003 0.0007 ...
(the first is the id).
My class uses the loadVectorsFromFile
method which multiplies it by 10000 and then int()
these numbers. On top of that, I also loop through each vector to ensure that there are no negative values inside. However, when I perform _hclustering
, I am continually seeing the error, "Linkage
Zcontains negative values"
.
I seriously think this is a bug, because:
Can someone enligten me as to why I am seeing this weird error? What is going on that is causing this negative distance error?
=====
def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
"""Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
"""
vectors = {}
self.winfo("Each vector is set to have %d limit in length" % limit)
with open( loc ) as inf:
for line in filter(None, inf.read().split('\n')):
l = line.split('\t')
if limit:
scores = map(float, l[1:limit+1])
else:
scores = map(float, l[1:])
if inflate:
vectors[ l[0]] = map( lambda x: int(x*10000), scores) #int might save space
else:
vectors[ l[0]] = scores
if assertAllPositive:
#Assert that it has no negative value
for dirID, l in vectors.iteritems():
if reduce(operator.or_, map( lambda x: x < 0, l)):
self.werror( "Vector %s has negative values!" % dirID)
return vectors
def main( self, inputDir, outputDir, limit=0,
inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
"""
Loads vector from a file and start clustering
INPUT
vectors is { featureID: tfidfVector (list), }
"""
IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
if not os.path.exists(outputDir):
os.makedirs(outputDir)
vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
for threshold in map( lambda x:float(x)/30, range(20,30)):
clusters = self._hclustering(threshold, vectors)
if clusters:
outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
with open(outputLoc, 'w') as outf:
for clusterNo, cluster in clusters.iteritems():
outf.write('%s\n' % str(clusterNo))
for featureID in cluster:
feature, group = IDFeatureDic[featureID]
outline = "%s\t%s\n" % (feature, group)
outf.write(outline.encode('utf-8'))
outf.write("\n")
else:
continue
def _hclustering(self, threshold, vectors):
"""function which you should call to vary the threshold
vectors: { featureID: [ tfidf scores, tfidf score, .. ]
"""
clusters = defaultdict(list)
if len(vectors) > 1:
try:
results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
except ValueError, e:
self.werror("_hclustering: %s" % str(e))
return False
for i, featureID in enumerate( vectors.keys()):
scipy.spatial.distance. euclidean(u, v, w=None)[source] Computes the Euclidean distance between two 1-D arrays. The Euclidean distance between 1-D arrays u and v, is defined as. ‖ u − v ‖ 2 ( ∑ ( w i | ( u i − v i ) | 2 ) ) 1 / 2.
scipy. stats. cdist(array, axis=0) function calculates the distance between each pair of the two collections of inputs. Parameters : array: Input array or object having the elements to calculate the distance between each pair of the two collections of inputs.
scipy.spatial.distance. cosine(u, v, w=None)[source] Compute the Cosine distance between 1-D arrays. The Cosine distance between u and v, is defined as. 1 − u ⋅ v ‖ u ‖ 2 ‖ v ‖ 2 .
A distance matrix contains the distances computed pairwise between the vectors of matrix/ matrices. scipy. spatial package provides us distance_matrix() method to compute the distance matrix. Generally matrices are in the form of 2-D array and the vectors of the matrix are matrix rows ( 1-D array). Syntax: scipy.
This is because of floating-point inaccuracy, so some distances between your vectors, instead of being 0, are for example -0.000000000000000002. Use scipy.clip()
function to correct the problem. If your distance matrix is dmatr
, use numpy.clip(dmatr,0,1,dmatr)
and you should be ok.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With