Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conv1D(filters=N, kernel_size=K) versus Dense(output_dim=N) layer

I have an input tensor T of size [batch_size=B, sequence_length=L, dim=K]. Is applying a 1D convolution of N filters and kernel size K the same as applying a dense layer with output dimension of N?

For example in Keras:

Conv1D(filters=N, kernel_size=K)

vs

Dense(units=N)

Note for Conv1D, I reshape the tensor T to [batch_size*sequence_length, dim=K, 1] to perform the convolution.

Both result in learnable weights of 20,480 + 256 (bias). Yet using Conv1D learns much faster initially for me. I don't see how Dense() is any different in this case and I'd like to use Dense() method in order have the lower vram consumption as well as not reshape tensor.


Follow up clarification:

The two answers provided two different ways to perform the 1D convolution. How are the following methods different?:

Method 1:

- Reshape input to [batch_size * frames, frame_len]
- convolve with Conv1D(filters=num_basis, kernel_size=frame_len)
- Reshape the output of the convolution layer to [batch_size, frames, num_basis]

Method 2:

- Convolve with Conv1D(filters=num_basis, kernel_size=1) on Input=[batch_size, frames, frame_len]. No input reshaping.
- No need to reshape output, it's already [batch_size, frames, num_basis]

My understanding is that it's the same operation (they have the same #parameters). However, I'm getting faster convergence with method 1.

like image 292
Artash Avatar asked Jan 22 '19 10:01

Artash


People also ask

What should be the kernel size in Conv1D?

The Conv1D layer learns 12 (c·m = 12) kernels and returns 4 filters. Each output filter is the average of its 3 kernels.

What is the difference between Conv1D and conv2d?

conv1d is used when you slide your convolution kernels along 1 dimensions (i.e. you reuse the same weights, sliding them along 1 dimensions), whereas tf. layers. conv2d is used when you slide your convolution kernels along 2 dimensions (i.e. you reuse the same weights, sliding them along 2 dimensions).

What is Conv1D layer?

1D convolution layer (e.g. temporal convolution). This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs.

What is dense layer?

Dense Layer is simple layer of neurons in which each neuron receives input from all the neurons of previous layer, thus called as dense. Dense Layer is used to classify image based on output from convolutional layers. Working of single neuron. A layer contains multiple number of such neurons.


1 Answers

To achieve the same behaviour as a Dense layer using a Conv1d layer, you need to make sure that any output neuron from the Conv1d is connected to every input neuron.

For an input of size [batch_size, L, K], your Conv1d needs to have a kernel of size L and as many filters as you want outputs neurons. To understand why, let's go back to the definition of a 1d convolution or temporal convolution.

The Conv1d layer’s parameters consist of a set of learnable filters. Every filter is usually small temporally and extends through the full depth of the input volume. For example, in your problem, a typical filter might have size 5xK (i.e. 5 steps of your sequence, and K because your input have depth K). During the forward pass, we slide (more precisely, convolve) each filter across the different steps of the input volume's sequence and compute dot products between the entries of the filter and the input at any position. As we slide the filter, we will produce a 1-dimensional activation map that gives the responses of that filter at every spatial position.

Now, if your filters are of size LxK, you can easily see that you will have only one possible spatial position (as the filter is the same size as the sequence) that will be the dot product between the full input volume and the weights LxK for each filter. The different filters composing your Conv1d now behave the same as the units composing a Dense layer: they are fully connected to your input.

You can verify this behaviour with the following code:

import tensorflow as tf
import numpy as np

l = 10
k = 2
n = 5

x = tf.placeholder(tf.float32, [None, l, k])
c = tf.layers.conv1d(inputs=x, strides=1, filters=n, kernel_size=l, kernel_initializer=tf.ones_initializer())
d = tf.layers.dense(inputs=tf.reshape(x, [-1, l*k]), units=n, kernel_initializer=tf.ones_initializer())

batch_size = 10

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    r_conv, r_dense = sess.run([c, d], {x: np.random.normal(size=[batch_size, l, k])})

print(r_conv.shape, r_dense.shape)
#(10, 1, 5) (10, 5)

print(np.allclose(r_conv.reshape([batch_size, -1]), r_dense.reshape([batch_size, -1])))
#True

For the same initialization, the outputs are indeed equal.

Regarding speed, I suppose that one of the main reason the Conv1d was faster and took more VRAM was because of your reshape: you were virtually increasing your batch size, improving parallelization at the cost of memory.


Edit after follow up clarification:

Maybe I misunderstood your question. Method 1 and Method 2 are the same but they are not the same as applying a Dense layer to the Input=[B, LxK].

Here, your outputs are connected to the full dimension K and then the same weights are used for every time step of your sequence meaning that both method are only fully connected to the frame but not the sequence. This is equivalent to a Dense layer on [BxL, K] indeed.

You can verify this behaviour with the following code:

l = 10
k = 2
n = 5

x = tf.placeholder(tf.float32, [None, l, k])
c2 = tf.layers.conv1d(inputs=x, strides=1, filters=n, kernel_size=1, kernel_initializer=tf.ones_initializer())
c3 = tf.layers.conv1d(inputs=tf.reshape(x, [-1, k, 1]), strides=1, filters=n, kernel_size=k, kernel_initializer=tf.ones_initializer())
d2 = tf.layers.dense(inputs=tf.reshape(x, [-1, k]), units=n, kernel_initializer=tf.ones_initializer())

batch_size = 10

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    r_d2, r_c2, r_c3 = sess.run([d2, c2, c3], {x: np.random.normal(size=[batch_size, l, k])})
    r_d2 = r_d2.reshape([10, 10, 5])
    r_c3 = r_c3.reshape([10, 10, 5])

print(r_d2.shape, r_c2.shape, r_c3.shape)
#(10, 10, 5) (10, 10, 5) (10, 10, 5)

print(np.allclose(r_d2, r_c2))
#True
print(np.allclose(r_d2, r_c3))
#True
print(np.allclose(r_c2, r_c3))
#True

Concerning the speed, it must be because there is only one dot product in Method 1 to compute the result whereas you have L in Method 2 + other operations.

like image 97
Olivier Dehaene Avatar answered Sep 30 '22 10:09

Olivier Dehaene