I'm new to tensorflow 2.0, and haven't done much except designing and training some artificial neural networks from boilerplate code. I'm trying to solve an exercise for newcomers into the new tensorflow. I created some code, but it doesn't work. Below is the problem definition:
Assuming we have tensor M
of rational numbers in shape of (a, b, c)
and scalar p ∈ (0, 1)
(memory factor), let’s create a function that will return tensor N
in shape of (a, b, c)
. Each element of N
tensors moving along axis c should be increased by the value of predecessor multiplied by p
.
Assuming we have tensor:
T = [x1, x2, x3, x4]
in shape of (1, 1, 4)
, we would like to get vector:
[x1, x2+x1·p, x3+(x2+x1·p)·p, x4+(x3+(x2+x1·p)·p)*p]
Solution should be created in Tensorflow 2.0 and should be focused on delivering the shortest execution time on CPU. Created graph should allow to efficiently calculate derivative both on tensor M
and value p
.
This is the code I created till now:
import tensorflow as tf
@tf.function
def vectorize_predec(t, p):
last_elem = 0
result = []
for el in t:
result.append(el + (p * last_elem))
last_elem = el + (p * last_elem)
return result
p = tf.Variable(0.5, dtype='double')
m = tf.constant([[0, 1, 2, 3, 4],
[1, 3, 5, 7, 10],
[1, 1, 1, -1, 0]])
vectorize_predec(m, p)
But it throws a TypeError
.
I looked around documentation, I've seen functions like cumsum
and polyeval
, but I'm not sure they fit my needs. To my understanding, I need to write my own customer function annotated with @tf.function
. I'm also not sure how to handle 3-dimension tensors properly according to the problem definition (adding the predecessor should happen on the last ("c") axis).
I've seen in documentation (here: https://www.tensorflow.org/tutorials/customization/performance) that there are ways to measure size of the produced graph. Although, I'm not sure how "graph" allows to efficiently calculate derivative both on tensor M
and value p
. ELI5 answers appreciated, or at least some materials I can read to educate myself better.
Thanks a lot!
I'll give you a couple of different methods to implement that. I think the most obvious solution is to use tf.scan
:
import tensorflow as tf
def apply_momentum_scan(m, p, axis=0):
# Put axis first
axis = tf.convert_to_tensor(axis, dtype=tf.int32)
perm = tf.concat([[axis], tf.range(axis), tf.range(axis + 1, tf.rank(m))], axis=0)
m_t = tf.transpose(m, perm)
# Do computation
res_t = tf.scan(lambda a, x: a * p + x, m_t)
# Undo transpose
perm_t = tf.concat([tf.range(1, axis + 1), [0], tf.range(axis + 1, tf.rank(m))], axis=0)
return tf.transpose(res_t, perm_t)
However, you can also implement this as a particular matrix product, if you build a matrix of exponential factors:
import tensorflow as tf
def apply_momentum_matmul(m, p, axis=0):
# Put axis first and reshape
m = tf.convert_to_tensor(m)
p = tf.convert_to_tensor(p)
axis = tf.convert_to_tensor(axis, dtype=tf.int32)
perm = tf.concat([[axis], tf.range(axis), tf.range(axis + 1, tf.rank(m))], axis=0)
m_t = tf.transpose(m, perm)
shape_t = tf.shape(m_t)
m_tr = tf.reshape(m_t, [shape_t[0], -1])
# Build factors matrix
r = tf.range(tf.shape(m_tr)[0])
p_tr = tf.linalg.band_part(p ** tf.dtypes.cast(tf.expand_dims(r, 1) - r, p.dtype), -1, 0)
# Do computation
res_tr = p_tr @ m_tr
# Reshape back and undo transpose
res_t = tf.reshape(res_tr, shape_t)
perm_t = tf.concat([tf.range(1, axis + 1), [0], tf.range(axis + 1, tf.rank(m))], axis=0)
return tf.transpose(res_t, perm_t)
This can also be rewritten to avoid the first transposing (which in TensorFlow is expensive) with tf.tensordot
:
import tensorflow as tf
def apply_momentum_tensordot(m, p, axis=0):
# Put axis first and reshape
m = tf.convert_to_tensor(m)
# Build factors matrix
r = tf.range(tf.shape(m)[axis])
p_mat = tf.linalg.band_part(p ** tf.dtypes.cast(tf.expand_dims(r, 1) - r, p.dtype), -1, 0)
# Do computation
res_t = tf.linalg.tensordot(m, p_mat, axes=[[axis], [1]])
# Transpose
last_dim = tf.rank(res_t) - 1
perm_t = tf.concat([tf.range(axis), [last_dim], tf.range(axis, last_dim)], axis=0)
return tf.transpose(res_t, perm_t)
The three functions would be used in a similar way:
import tensorflow as tf
p = tf.Variable(0.5, dtype=tf.float32)
m = tf.constant([[0, 1, 2, 3, 4],
[1, 3, 5, 7, 10],
[1, 1, 1, -1, 0]], tf.float32)
# apply_momentum is one of the functions above
print(apply_momentum(m, p, axis=0).numpy())
# [[ 0. 1. 2. 3. 4. ]
# [ 1. 3.5 6. 8.5 12. ]
# [ 1.5 2.75 4. 3.25 6. ]]
print(apply_momentum(m, p, axis=1).numpy())
# [[ 0. 1. 2.5 4.25 6.125 ]
# [ 1. 3.5 6.75 10.375 15.1875]
# [ 1. 1.5 1.75 -0.125 -0.0625]]
Using a matrix product is more asymptotically complex, but it can be faster than scanning. Here is a small benchmark:
import tensorflow as tf
import numpy as np
# Make test data
tf.random.set_seed(0)
p = tf.constant(0.5, dtype=tf.float32)
m = tf.random.uniform([100, 30, 50], dtype=tf.float32)
# Axis 0
print(np.allclose(apply_momentum_scan(m, p, 0).numpy(), apply_momentum_matmul(m, p, 0).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 0).numpy(), apply_momentum_tensordot(m, p, 0).numpy()))
# True
%timeit apply_momentum_scan(m, p, 0)
# 11.5 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit apply_momentum_matmul(m, p, 0)
# 1.36 ms ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit apply_momentum_tensordot(m, p, 0)
# 1.62 ms ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Axis 1
print(np.allclose(apply_momentum_scan(m, p, 1).numpy(), apply_momentum_matmul(m, p, 1).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 1).numpy(), apply_momentum_tensordot(m, p, 1).numpy()))
# True
%timeit apply_momentum_scan(m, p, 1)
# 4.27 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit apply_momentum_matmul(m, p, 1)
# 1.27 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit apply_momentum_tensordot(m, p, 1)
# 1.2 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Axis 2
print(np.allclose(apply_momentum_scan(m, p, 2).numpy(), apply_momentum_matmul(m, p, 2).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 2).numpy(), apply_momentum_tensordot(m, p, 2).numpy()))
# True
%timeit apply_momentum_scan(m, p, 2)
# 6.29 ms ± 64.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit apply_momentum_matmul(m, p, 2)
# 1.41 ms ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit apply_momentum_tensordot(m, p, 2)
# 1.05 ms ± 26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, matrix product seems to win. Let's see if this scales:
import tensorflow as tf
import numpy as np
# Make test data
tf.random.set_seed(0)
p = tf.constant(0.5, dtype=tf.float32)
m = tf.random.uniform([1000, 300, 500], dtype=tf.float32)
# Axis 0
print(np.allclose(apply_momentum_scan(m, p, 0).numpy(), apply_momentum_matmul(m, p, 0).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 0).numpy(), apply_momentum_tensordot(m, p, 0).numpy()))
# True
%timeit apply_momentum_scan(m, p, 0)
# 784 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_matmul(m, p, 0)
# 1.13 s ± 76.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_tensordot(m, p, 0)
# 1.3 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Axis 1
print(np.allclose(apply_momentum_scan(m, p, 1).numpy(), apply_momentum_matmul(m, p, 1).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 1).numpy(), apply_momentum_tensordot(m, p, 1).numpy()))
# True
%timeit apply_momentum_scan(m, p, 1)
# 852 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_matmul(m, p, 1)
# 659 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_tensordot(m, p, 1)
# 741 ms ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Axis 2
print(np.allclose(apply_momentum_scan(m, p, 2).numpy(), apply_momentum_matmul(m, p, 2).numpy()))
# True
print(np.allclose(apply_momentum_scan(m, p, 2).numpy(), apply_momentum_tensordot(m, p, 2).numpy()))
# True
%timeit apply_momentum_scan(m, p, 2)
# 1.06 s ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_matmul(m, p, 2)
# 924 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit apply_momentum_tensordot(m, p, 2)
# 483 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Well, now it's not so clear anymore. Scanning is still not super fast, but matrix products are sometimes slower. As you can imagine if you go to even bigger tensors the complexity of matrix products will dominate the timings.
So, if you want the fastest solution and know your tensors are not going to get huge, use one of the matrix product implementations. If you're fine with okay speed but want to make sure you don't run out of memory (matrix solution also takes much more) and timing is predictable, you can use the scanning solution.
Note: Benchmarks above were carried out on CPU, results may vary significantly on GPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With