Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear vs nonlinear neural network?

I'm new to machine learning and neural networks. I know how to build a nonlinear classification model, but my current problem has a continuous output. I've been searching for information on neural network regression, but all I encounter is information on linear regression - nothing about nonlinear cases. Which is odd, because why would someone use neural networks to solve a simple linear regression anyway? Isn't that like killing a fly with a nuclear bomb?

So my question is this: what makes a neural network nonlinear? (Hidden layers? Nonlinear activation function?) Or do I have a completely wrong understanding of the word "linear" - can a linear regression NN accurately model datasets that are more complex than y=aX+b? Is the word "linear" used just as the opposite of "logistic"?

(I'm planning to use TensorFlow, but the TensorFlow Linear Model Tutorial uses a binary classification problem as an example, so that doesn't help me either.)

like image 242
nirnroot Avatar asked Dec 20 '16 14:12

nirnroot


People also ask

What is linear and non-linear neural network?

A Neural Network has got non linear activation layers which is what gives the Neural Network a non linear element. The function for relating the input and the output is decided by the neural network and the amount of training it gets.

Why do neural networks need to be nonlinear?

Without a non-linear activation function in your NN(neural network), no matter how many layers it had, it will behave just like a single-layer perceptron, because summing these layers would give you just another linear function which is not the same as output that renders to a straight line-the word for this is affine.

What is the difference between linear and non-linear activation function?

Uses : Linear activation function is used at just one place i.e. output layer. Issues : If we will differentiate linear function to bring non-linearity, result will no more depend on input “x” and function will become constant, it won't introduce any ground-breaking behavior to our algorithm.

What is linear and non-linear in deep learning?

The neural network without any activation function in any of its layers is called a linear neural network. The neural network which has action functions like relu, sigmoid or tanh in any of its layer or even in more than one layer is called non-linear neural network.


2 Answers

For starters, a neural network can model any function (not just linear functions) Have a look at this - http://neuralnetworksanddeeplearning.com/chap4.html.

A Neural Network has got non linear activation layers which is what gives the Neural Network a non linear element.

The function for relating the input and the output is decided by the neural network and the amount of training it gets. If you supply two variables having a linear relationship, then your network will learn this as long as you don't overfit. Similarly, a complex enough neural network can learn any function.

like image 142
martianwars Avatar answered Oct 04 '22 19:10

martianwars


WARNING: I do not advocate the use of linear activation functions only, especially in simple feed forward architectures.

Okay, I think I need to take some time and rewrite this answer explicitly because many people are misinterpreting the point I am trying to make.

First let me point out that we can talk about linearity in parameters or linearity in the variables.

The activation function is NOT necessarily what makes a neural network non-linear (technically speaking).

For example, notice that the following regression predicted values are considered linear predictions, despite non-linear transformations of the inputs because the output constitutes a linear combination of the parameters (although this model is non-linear in its variables):

formula

Now for simplicity, let us consider a single neuron, single layer neural network:

formula

If the transfer function is linear then:

formula

As you have already probably noticed, this is a linear regression. Even if we were to add multiple inputs and neurons, each with a linear activation function, we would now only have an ensemble of regressions (all linear in their parameters and therefore this simple neural network is linear):

formula

Now going back to (3), let's add two layers, so that we have a neural network with 3 layers, one neuron each (both with linear activation functions):

formula (first layer)

formula (second layer)

Now notice:

formula

Reduces to:

formula

Where formula and formula

Which means that our two layered network (each with a single neuron) is not linear in its parameters despite every activation function in the network being linear; however, it is still linear in the variables. Thus, once training has finished the model will be linear in both variables and parameters. Both of these are important because you cannot replicate this simple two layered network with a single regression and still capture all the effects of the model. Further, let me state clearly: if you use a model with multiple layers there is no guarantee that the output will be non-linear in it's variables (if you use a simple MLP perceptron and line activation functions your picture is still going to be a line).

That being said, let's take a look at the following statement from @Pawu regarding this answer:

The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear transformations, which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.

While you could argue that what @Pawu is saying is technically true, I think they are implying:

The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear activation functions, which is simply not true.

I would argue that this modified statement is wrong and can easily be demonstrated incorrect. There is an implicit assumption being made about the architecture of the model. It is true that if you restrict yourself to using certain network architectures that you cannot introduce non-linearities without activation functions, but that is a arbitrary restriction and does not generalize to all network models.

Let me make this concrete. First take a simple xor problem. This is a basic classification problem where you are attempting to establish a boundary between data points in a configuration like so:

xor problem

The kicker about this problem is that it is not linearly separable, meaning no single straight line will be able to perfectly classify. Now if you read anywhere on the internet I am sure they will say that this problem cannot be solved using only linear activation functions using a neural network (notice nothing is said about the architecture). This statement is only true in an extremely limited context and wrong generally.

Allow me to demonstrate. Below is a very simple hand written neural network. This network takes randomly generated weights between -1 and 1, an "xor_network" function which defines the architecture (notice no sigmoid, hardlims, etc. only linear transformations of the form mX or MX + B), and trains using standard backward propagation:

#%% Packages import numpy as np  #%% Data data = np.array([[0, 0, 0],[0, 1, 1],[1, 0, 1],[1, 1, 0]]) np.random.shuffle(data) train_data = data[:,:2] target_data = data[:,2]  #%% XOR architecture class XOR_class():      def __init__(self, train_data, target_data, alpha=.1, epochs=10000):         self.train_data = train_data         self.target_data = target_data         self.alpha = alpha         self.epochs = epochs                  #Random weights         self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T         self.b0 = np.random.uniform(low=-1, high=1, size=(1))         self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T         self.b2 = np.random.uniform(low=-1, high=1, size=(1))      #xor network (linear transfer functions only)     def xor_network(self, X0):         n0 = np.dot(X0, self.W0) + self.b0         X1 = n0*X0         a = np.dot(X1, self.W2) + self.b2         return(a, X1)      #Training the xor network     def train(self):         for epoch in range(self.epochs):             for i in range(len(self.train_data)):                 # Forward Propagation:                 X0 = self.train_data[i]                 a, X1 = self.xor_network(X0)                          # Backward Propagation:                 e = self.target_data[i] - a                 s_2 = -2*e                          # Update Weights:                 self.W0 = self.W0 - (self.alpha*s_2*X0)                 self.b0 = self.b0 - (self.alpha*s_2)                 self.W2 = self.W2 - (self.alpha*s_2*X1)                 self.b2 = self.b2 - (self.alpha*s_2)                  #Restart training if we get lost in the parameter space.         if np.isnan(a) or (a > 1) or (a < -1):             print('Bad initialization, reinitializing.')             self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T             self.b0 = np.random.uniform(low=-1, high=1, size=(1))             self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T             self.b2 = np.random.uniform(low=-1, high=1, size=(1))             self.train()          #Predicting using the trained weights.     def predict(self, test_data):         for i in train_data:             a, X1 = self.xor_network(i)             #I cut off decimals past 12 for convienience, not necessary.             print(f'input: {i} - output: {np.round(a, 12)}') 

Now let's take a look at the output:

#%% Execution xor = XOR_class(train_data, target_data) xor.train() np.random.shuffle(data) test_data = data[:,:2] xor.predict(test_data) 
input: [1 0] - output: [1.] input: [0 0] - output: [0.] input: [0 1] - output: [1.] input: [1 1] - output: [0.] 

And what do you know, I guess we can learn non-linear relationships using only linear activation functions and multiple layers (that's right classification with pure line activation functions, no sigmoid needed). . .

The only catch here is that I cut off all decimals past 12, but let's be honest 7.3 X 10^-16 is basically 0.

Now to be fair I am doing a little trick, where I am using the network connections to get the non-linear result, but that's the whole point I am trying to drive home: THE MAGIC OF NON-LINEARITY FOR NEURAL NETWORKS IS IN THE LAYERS, NOT JUST THE ACTIVATION FUNCTIONS.

Thus the answer to your question, "what makes a neural network non-linear" is: non-linearity in the parameters or, obviously, non-linearity in the variables.

This non-linearity in the parameters/variables comes about two ways: 1) having more than one layer with neurons in your network (as exhibited above), or 2) having activation functions that result in weight non-linearities.

For an example on non-linearity coming about through activation functions, suppose our input space, weights, and biases are all constrained such that they are all strictly positive (for simplicity). Now using (2) (single layer, single neuron) and the activation function formula, we have the following:

formula

Which Reduces to:

formula

Where formula, formula, and formula

Now, ignoring what issues this neural network has, it should be clear, that at the very least, it is non-linear in the parameters and variables and that non-linearity has been introduced solely by choice of the activation function.

Finally, yes neural networks can model complex data structures that cannot be modeled by using linear models (see xor example above).

EDIT:

As pointed out by @hH1sG0n3, non-linearity in the parameters does not follow directly from many common activation functions (e.g. sigmoid). This is not to say that common activation functions do not make neural networks nonlinear (because they are non-linear in the variables), but that the non-linearity introduced by them is degenerate without parameter non-linearity. For example, a single layered MLP with sigmoid activation functions will produce outputs that are non-linear in the variables in that the output is not proportional to the input, but in reality this is just an array of Generalized Linear Models. This should be especially obvious if we were to transform the targets by the appropriate link function, where now the activation functions would be linear. Now this is not to say that activation functions don't play an important role in the non-linearity of neural networks (clearly they do), but that their role is more to alter/expand the solution space. Said differently, non-linearities in the parameters (usually expressed through many layers/connections) are necessary for non-degenerate solutions that go beyond regression. When we have a model with non-linearity in the parameters we have a whole different beast than regression.

At the end of the day all I want to do with this post is point out that the "magic" of neural networks is also in the layers and to dispel the ubiquitous myth that a multilayered neural network with linear activation functions is always just a bunch of linear regressions.

like image 22
itwasthekix Avatar answered Oct 04 '22 18:10

itwasthekix