Shallow Neural Networks

Shallow neural networks describe piecewise linear functions and good at expressing complex relationships between multi dimensional inputs and outputs. Piecewose functions are functions that are composed of multiple linear segments(i.e the slope of the function changes at certain points ) and each segment corresponds to a sections of values in the domain. forexample:

f(x)={2x+1if x<03x+2if x0f(x) = \begin{cases} 2x + 1 & \text{if} \ x < 0 \\ -3x + 2 & \text{if} \ x \geq 0 \end{cases}

How do Shallow neural networks work.

Lets take an example of a nueral network y=f[x,θ]y = f[x,\theta], this nueral network is a function that takes an input xx and has parameters ϕ\phi. Say when we expand the equation we get

y=ϕ0+ϕ1a(θ10+θ11x)+ϕ2a(θ20+θ21x)+ϕ3a(θ30+θ31x)y = \phi_0 + \phi_1 a(\theta_{10} + \theta_{11}x) + \phi_2 a(\theta_{20} + \theta_{21}x) + \phi_3 a(\theta_{30} + \theta_{31}x)

If you analyse the abov equation clearly, you can split it into multiple components which are, the linear functions (θ10+θ11x\theta_{10} + \theta_{11}x, θ20+θ21x\theta_{20} + \theta_{21}x, θ30+θ31x\theta_{30} + \theta_{31}x), an activation a[]a[\cdot], weights ϕ1,ϕ2,ϕ3\phi_1, \phi_2, \phi_3 and bias ϕ0\phi_0

The set of linear functions are what we call hidden units and we can brake them down to

h1=a(θ10+θ11x)h_1 = a(\theta_{10} + \theta_{11}x) h2=a(θ20+θ21x)h_2 = a(\theta_{20} + \theta_{21}x) h3=a(θ30+θ31x)h_3 = a(\theta_{30} + \theta_{31}x)

and once we are done calculating them, we then multiply them to the weights and sum it all up with the bias as so

y=ϕ0+ϕ1h1+ϕ2h2+ϕ3h3y = \phi_0 + \phi_1 h_1 + \phi_2 h_2 + \phi_3 h_3

The activation function a[]a[\cdot], is there to add non-linearity to the neural network.without it, the neural network is just a big linear equation. There are many examples of activation functions but the most common one and the one we shall use is called rectified linear unit or ReLU. its a function that returns the zero(00) if the input is negative otherwise it returns the same input i.e

ReLU(z)={0if z<0zif z0\text{ReLU}(z) = \begin{cases} 0 & \text{if } z < 0 \\ z & \text{if } z \geq 0 \end{cases}

Activation functions accomplish this non-linearity by determining what neuron in the neural network is active or inactive basing on the input. Take an example where we have the following parameters.

ParameterValue
θ10\theta_{10}-1
θ11\theta_{11}1
θ20\theta_{20}0
θ21\theta_{21}1
θ30\theta_{30}1
θ31\theta_{31}1
ϕ0\phi_00
ϕ1\phi_12
ϕ2\phi_2-1
ϕ3\phi_31

The neural network equation becomes

h1=ReLU(1+1x)h_1 = \text{ReLU}(-1 + 1 \cdot x) h2=ReLU(0+1x)h_2 = \text{ReLU}(0 + 1 \cdot x) h3=ReLU(1+1x)h_3 = \text{ReLU}(1 + 1 \cdot x) y=0+2h11h2+1h3y = 0 + 2 \cdot h_1 - 1 \cdot h_2 + 1 \cdot h_3

Case 1: x=2x = -2

h1=ReLU(1+(2))=ReLU(3)=0h_1 = \text{ReLU}(-1 + (-2)) = \text{ReLU}(-3) = 0 h2=ReLU(0+(2))=ReLU(2)=0h_2 = \text{ReLU}(0 + (-2)) = \text{ReLU}(-2) = 0 h3=ReLU(1+(2))=ReLU(1)=0h_3 = \text{ReLU}(1 + (-2)) = \text{ReLU}(-1) = 0 y=0+2010+10=0y = 0 + 2 \cdot 0 - 1 \cdot 0 + 1 \cdot 0 = 0

Output for x=2x = -2 is 0


Case 2: x=0x = 0

h1=ReLU(1+0)=ReLU(1)=0h_1 = \text{ReLU}(-1 + 0) = \text{ReLU}(-1) = 0 h2=ReLU(0+0)=ReLU(0)=0h_2 = \text{ReLU}(0 + 0) = \text{ReLU}(0) = 0 h3=ReLU(1+0)=ReLU(1)=1h_3 = \text{ReLU}(1 + 0) = \text{ReLU}(1) = 1 y=0+2010+11=1y = 0 + 2 \cdot 0 - 1 \cdot 0 + 1 \cdot 1 = 1

Output for x=0x = 0 is 1


Case 3: x=1x = 1

h1=ReLU(1+1)=ReLU(0)=0h_1 = \text{ReLU}(-1 + 1) = \text{ReLU}(0) = 0 h2=ReLU(0+1)=ReLU(1)=1h_2 = \text{ReLU}(0 + 1) = \text{ReLU}(1) = 1 h3=ReLU(1+1)=ReLU(2)=2h_3 = \text{ReLU}(1 + 1) = \text{ReLU}(2) = 2 y=0+2011+12=1+2=1y = 0 + 2 \cdot 0 - 1 \cdot 1 + 1 \cdot 2 = -1 + 2 = 1

Output for x=1x = 1 is 1


Case 4: x=3x = 3

h1=ReLU(1+3)=ReLU(2)=2h_1 = \text{ReLU}(-1 + 3) = \text{ReLU}(2) = 2 h2=ReLU(0+3)=ReLU(3)=3h_2 = \text{ReLU}(0 + 3) = \text{ReLU}(3) = 3 h3=ReLU(1+3)=ReLU(4)=4h_3 = \text{ReLU}(1 + 3) = \text{ReLU}(4) = 4 y=0+2213+14=43+4=5y = 0 + 2 \cdot 2 - 1 \cdot 3 + 1 \cdot 4 = 4 - 3 + 4 = 5

Output for x=3x = 3 is 5

At small xx, all the ReLUs are inactive (outputs 0) → output is 0. As xx increases, hidden units "turn on" at different points:

Interactive Graph Demo


Find the Source code to this Interactive Graph on Github

Open GitHub

Universal Approximatation Theorem

In the above example we have shown a neural network with only 3 hidden units, however a shallow neural network can have very many hidden units depending on the complexity/continuity of the function its trying to describe. We can generalize this and consider the case with DD hidden units where the dthdth hidden unit is:

hd=a(θd+θdx)h_d = a(\theta_{d} + \theta_{d}x)

And when combined linearily we can come up with the output layer equation

yj=ϕj0+d=1Dϕjdhdy_j = \phi_{j0} + \sum_{d=1}^{D} \phi_{jd} h_d

In summary, the Universal Approximatation Theorem proves that for any continuous function, there exists a shallow network that can approximate this function to any specified precision.

Multivariant Inputs and Outputs

The above example that we have been working with only has one input and one output. But in the gereal case you will find that, you have a neural network that takes in multiple inputs and predicts multiple outputs. In that case the functional representation changes. Lets take and example of a neural network that has 3 inputs x=[x1,x2,x3]x=[x_1, x_2, x_3] and 3 outputs y=[y1,y2]y=[y_1, y_2], in that case we can represent the hidden units as follows:

h1=ReLU(θ10+θ11x1+θ12x2+θ13x3)h_1 = \text{ReLU}(\theta_{10} + \theta_{11}x_1 + \theta_{12}x_2 + \theta_{13}x_3) h2=ReLU(θ20+θ21x1+θ22x2+θ23x3)h_2 = \text{ReLU}(\theta_{20} + \theta_{21}x_1 + \theta_{22}x_2 + \theta_{23}x_3) h3=ReLU(θ30+θ31x1+θ32x2+θ33x3)h_3 = \text{ReLU}(\theta_{30} + \theta_{31}x_1 + \theta_{32}x_2 + \theta_{33}x_3)

and the output layer equations are defined as

y1=ϕ10+ϕ11h1+ϕ12h2+ϕ13h3y_1 = \phi_{10} + \phi_{11} h_1 + \phi_{12} h_2 + \phi_{13} h_3 y2=ϕ20+ϕ21h1+ϕ22h2+ϕ23h3y_2 = \phi_{20} + \phi_{21} h_1 + \phi_{22} h_2 + \phi_{23} h_3

From these equations, we can derive a General Case equation where the input x=[x1,x2,...,xDi]RDix = [x_1, x_2, ..., x_{D_i}] \in \mathbb{R}^{D_i} and the output y=[y1,y2,...,yDo]RDoy = [y_1, y_2, ..., y_{D_o}] \in \mathbb{R}^{D_o} whereby DiD_i = Number of input features (dimensions) and DoD_o = Number of outputs. In this case each hidden unit computes:

hd=a(θd0+i=1Diθdixi)h_d = a \left( \theta_{d0} + \sum_{i=1}^{D_i} \theta_{di} x_i \right)

and the output layer equation is

yj=ϕj0+d=1Dϕjdhdy_j = \phi_{j0} + \sum_{d=1}^D \phi_{jd} h_d