I have a theano symbolic matrix <pre class="prettyprint"><code>x = T.fmatrix('input') </code></pre> <code>x</code> will be later on populated by <code>n</code> vectors of dim <code>d</code> (at train time). I would like to have the theano equivalent of <code>pdist</code> (<code>scipy.spatial.distance.pdist</code> of <code>pdist</code>), something like <pre class="prettyprint"><code>D = theano.pdist( x ) </code></pre> How can I achieve this? Calling <code>scipy.spatial.distance.pdist</code> on <code>x</code> directly does not work as <code>x</code> at this stage is only symbolic... Update: I would very much like to be able to mimic <code>pdist</code> "compact" behavior: that is, computing only ~1/2 of the <code>n</code>x<code>n</code> entries of the distance matrix.

<code>pdist</code> from scipy is a collection of different functions - there doesn't exist a Theano equivalent for all of them at once. However, each specific distance, being a closed form mathematical expression, can be written down in Theano as such and then compiled. Take as a example the minkowski <code>p</code> norm distance (copy+pasteable): <pre class="prettyprint"><code>import theano import theano.tensor as T X = T.fmatrix('X') Y = T.fmatrix('Y') P = T.scalar('P') translation_vectors = X.reshape((X.shape[0], 1, -1)) - Y.reshape((1, Y.shape[0], -1)) minkowski_distances = (abs(translation_vectors) ** P).sum(2) ** (1. / P) f_minkowski = theano.function([X, Y, P], minkowski_distances) </code></pre> Note that <code>abs</code> calls the built-in <code>__abs__</code>, so <code>abs</code> is also a theano function. We can now compare this to <code>pdist</code>: <pre class="prettyprint"><code>import numpy as np from scipy.spatial.distance import pdist rng = np.random.RandomState(42) d = 20 # dimension nX = 10 nY = 30 x = rng.randn(nX, d).astype(np.float32) y = rng.randn(nY, d).astype(np.float32) ps = [1., 3., 2.] for p in ps: d_theano = f_minkowski(x, x, p)[np.triu_indices(nX, 1)] d_scipy = pdist(x, p=p, metric='minkowski') print "Testing p=%1.2f, discrepancy %1.3e" % (p, np.sqrt(((d_theano - d_scipy) ** 2).sum())) </code></pre> This yields <pre class="prettyprint"><code>Testing p=1.00, discrepancy 1.322e-06 Testing p=3.00, discrepancy 4.277e-07 Testing p=2.00, discrepancy 4.789e-07 </code></pre> As you can see, the correspondence is there, but the function <code>f_minkowski</code> is slightly more general, since it compares the lines of two possibly different arrays. If twice the same array is passed as input, <code>f_minkowski</code> returns a matrix, whereas <code>pdist</code> returns a list without redundancy. If this behaviour is desired, it can also be implemented fully dynamically, but I will stick to the general case here. One possibility of specialization should be noted though: In the case of <code>p=2</code>, the calculations become simpler through the binomial formula, and this can be used to save precious space in memory: Whereas the general Minkowski distance, as implemented above, creates a 3D array (due to avoidance of for-loops and summing cumulatively), which is prohibitive, depending on the dimension <code>d</code> (and <code>nX, nY</code>), for <code>p=2</code> we can write <pre class="prettyprint"><code>squared_euclidean_distances = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (Y ** 2).sum(1).reshape((1, Y.shape[0])) - 2 * X.dot(Y.T) f_euclidean = theano.function([X, Y], T.sqrt(squared_euclidean_distances)) </code></pre> which only uses <code>O(nX * nY)</code> space instead of <code>O(nX * nY * d)</code> We check for correspondence, this time on the general problem: <pre class="prettyprint"><code>d_eucl = f_euclidean(x, y) d_minkowski2 = f_minkowski(x, y, 2.) print "Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy %1.3e" % ((d_eucl - d_minkowski2) ** 2).sum() </code></pre> yielding <pre class="prettyprint"><code>Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy 1.464e-11 </code></pre>

pdist for theano tensor

Tags:

python

scipy

matlab

theano

I have a theano symbolic matrix

x = T.fmatrix('input')

x will be later on populated by n vectors of dim d (at train time).

I would like to have the theano equivalent of pdist (scipy.spatial.distance.pdist of pdist), something like

D = theano.pdist( x )

How can I achieve this?

Calling scipy.spatial.distance.pdist on x directly does not work as x at this stage is only symbolic...

Update: I would very much like to be able to mimic pdist "compact" behavior: that is, computing only ~1/2 of the nxn entries of the distance matrix.

530

asked Sep 17 '14 09:09

Shai

1 Answers

pdist from scipy is a collection of different functions - there doesn't exist a Theano equivalent for all of them at once. However, each specific distance, being a closed form mathematical expression, can be written down in Theano as such and then compiled.

Take as a example the minkowski p norm distance (copy+pasteable):

import theano
import theano.tensor as T
X = T.fmatrix('X')
Y = T.fmatrix('Y')
P = T.scalar('P')
translation_vectors = X.reshape((X.shape[0], 1, -1)) - Y.reshape((1, Y.shape[0], -1))
minkowski_distances = (abs(translation_vectors) ** P).sum(2) ** (1. / P)
f_minkowski = theano.function([X, Y, P], minkowski_distances)

Note that abs calls the built-in __abs__, so abs is also a theano function. We can now compare this to pdist:

import numpy as np
from scipy.spatial.distance import pdist

rng = np.random.RandomState(42)
d = 20 # dimension
nX = 10
nY = 30
x = rng.randn(nX, d).astype(np.float32)
y = rng.randn(nY, d).astype(np.float32)

ps = [1., 3., 2.]

for p in ps:
    d_theano = f_minkowski(x, x, p)[np.triu_indices(nX, 1)]
    d_scipy = pdist(x, p=p, metric='minkowski')
    print "Testing p=%1.2f, discrepancy %1.3e" % (p, np.sqrt(((d_theano - d_scipy) ** 2).sum()))

This yields

Testing p=1.00, discrepancy 1.322e-06
Testing p=3.00, discrepancy 4.277e-07
Testing p=2.00, discrepancy 4.789e-07

As you can see, the correspondence is there, but the function f_minkowski is slightly more general, since it compares the lines of two possibly different arrays. If twice the same array is passed as input, f_minkowski returns a matrix, whereas pdist returns a list without redundancy. If this behaviour is desired, it can also be implemented fully dynamically, but I will stick to the general case here.

One possibility of specialization should be noted though: In the case of p=2, the calculations become simpler through the binomial formula, and this can be used to save precious space in memory: Whereas the general Minkowski distance, as implemented above, creates a 3D array (due to avoidance of for-loops and summing cumulatively), which is prohibitive, depending on the dimension d (and nX, nY), for p=2 we can write

squared_euclidean_distances = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (Y ** 2).sum(1).reshape((1, Y.shape[0])) - 2 * X.dot(Y.T)
f_euclidean = theano.function([X, Y], T.sqrt(squared_euclidean_distances))

which only uses O(nX * nY) space instead of O(nX * nY * d) We check for correspondence, this time on the general problem:

d_eucl = f_euclidean(x, y)
d_minkowski2 = f_minkowski(x, y, 2.)
print "Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy %1.3e" % ((d_eucl - d_minkowski2) ** 2).sum()

yielding

Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy 1.464e-11

190

answered Oct 19 '22 23:10

eickenberg

Related questions
                            
                                Python Pandas: Keeping only dataframe rows containing first occurrence of an item
                            
                                Dictionary of lists to Dictionary
                            
                                Why does an import not always import nested packages?
                            
                                How to install PL/Python on PostgreSQL 9.3 x64 Windows 7?
                            
                                What is the point of a naive datetime
                            
                                Python: Sklearn.linear_model.LinearRegression working weird
                            
                                Groupby given percentiles of the values of the chosen DataFrame column
                            
                                Intermittent "OSError: [Errno 7] Argument list too long" with short command (~125 chars)
                            
                                Change the title of factor plot in seaborn
                            
                                Define variables(Macro) in Python
                            
                                How to get data from inspect element of a webpage using Python
                            
                                Python waiting for a queue and an event
                            
                                Docx library doesn't recognizes Document method
                            
                                How to 'pip install uwsgi' with alternative build configuration?
                            
                                Python multiple inheritance constructor not called when using super()
                            
                                Static Root and Static Url confusion in Django
                            
                                Node.js's python child script outputting on finish, not real time
                            
                                DatabaseSessionIsOver with Pony ORM due to lazy loading?
                            
                                NLTK tree data structure, finding a node, it's parent or children
                            
                                Django run all tests at once

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With