Shallow neural networks describe piecewise linear functions and good at expressing complex relationships between multi dimensional inputs and outputs. Piecewose functions are functions that are composed of multiple linear segments(i.e the slope of the function changes at certain points ) and each segment corresponds to a sections of values in the domain. forexample:
f(x)={2x+1−3x+2ifx<0ifx≥0
How do Shallow neural networks work.
Lets take an example of a nueral network y=f[x,θ], this nueral network is a function that takes an input x and has parameters ϕ. Say when we expand the equation we get
If you analyse the abov equation clearly, you can split it into multiple components which are, the linear functions (θ10+θ11x, θ20+θ21x, θ30+θ31x), an activation a[⋅], weights ϕ1,ϕ2,ϕ3 and bias ϕ0
The set of linear functions are what we call hidden units and we can brake them down to
and once we are done calculating them, we then multiply them to the weights and sum it all up with the bias as so
y=ϕ0+ϕ1h1+ϕ2h2+ϕ3h3
The activation function a[⋅], is there to add non-linearity to the neural network.without it, the neural network is just a big linear equation. There are many examples of activation functions but the most common one and the one we shall use is called rectified linear unit or ReLU. its a function that returns the zero(0) if the input is negative otherwise it returns the same input i.e
ReLU(z)={0zif z<0if z≥0
Activation functions accomplish this non-linearity by determining what neuron in the neural network is active or inactive basing on the input. Take an example where we have the following parameters.
In the above example we have shown a neural network with only 3 hidden units, however a shallow neural network can have very many hidden units depending on the complexity/continuity of the function its trying to describe. We can generalize this and consider the case with D hidden units where the dth hidden unit is:
hd=a(θd+θdx)
And when combined linearily we can come up with the output layer equation
yj=ϕj0+d=1∑Dϕjdhd
In summary, the Universal Approximatation Theorem proves that for any continuous function, there exists a shallow network that can approximate this function to any specified precision.
Multivariant Inputs and Outputs
The above example that we have been working with only has one input and one output. But in the gereal case you will find that, you have a neural network that takes in multiple inputs and predicts multiple outputs. In that case the functional representation changes. Lets take and example of a neural network that has 3 inputs x=[x1,x2,x3] and 3 outputs y=[y1,y2], in that case we can represent the hidden units as follows:
From these equations, we can derive a General Case equation where the input x=[x1,x2,...,xDi]∈RDi and the output y=[y1,y2,...,yDo]∈RDo whereby Di = Number of input features (dimensions) and Do = Number of outputs. In this case each hidden unit computes: