Neural Network Implementation

With the derivative of the Cost function derived from the last chapter, we can code the network

We will use matrices to represent input and weight matrices.

[!NOTE] Important Note on Matrix Conventions: In the previous chapters (Matrix Calculus), we derived equations assuming Column Vectors (where input $x$ is $N \times 1$). However, in standard Python/NumPy implementations (like the one below), we typically use Batch Processing where inputs are Row Vectors (where input $X$ is $BatchSize \times Features$).

This means:

Input $X$ has shape (BatchSize, InputFeatures)

Weight $W$ has shape (InputFeatures, OutputFeatures)

Forward pass is $Z = X \cdot W$ (instead of $W \cdot x$)

This effectively transposes the standard mathematical notation. The code below follows this “Row Vector/Batch” convention.

x = np.array(
    [
        [0,0,1],
        [0,1,1],
        [1,0,1],
        [1,1,1]
    ])

This is a 4*3 matrix. Note that each row is an input. lets take all this 4 as ‘training set’

y = np.array(
  [
      [0],
      [1],
      [0],
      [1]
  ])

Note you can change the output and try to train the Neural network

This is a 4*1 matrix that represent the expected output. That is for input [0,0,1] the output is [0] and for [0,1,1] the output is [1] etc.

A neural network is implemented as a set of matrices representing the weights of the network.

Let’s create a two layered network. Before that please note the formula for the neural network

So basically the output at layer l is the dot product of the weight matrix of layer l and input of the previous layer.

Now let’s see how the matrix dot product works based on the shape of matrices.

[m*n].[n*x] = [m*x]
[m*x].[x*y] = [m*y]

We take the $[mn]$ as the input matrix this is a $[43]$ matrix.

Similarly the output $y$ is a $[41]$ matrix; so we have $[my] =[4*1]$

So we have

m=4
n=3
x=?
y=1

Lets then create our two weight matrices of the above shapes, that represent the two layers of the neural network.

w0 = x
w1 = np.random.random((3,4))
w2 = np.random.random((4,1))

We can have an array of the weights to loop through, but for the time being let’s hard-code these. Note that ‘np’ stands for the popular numpy array library in Python.

We also need to code in our non-linearity. We will use the Sigmoid function here.

def sigmoid(x):
    return 1/(1+np.exp(-x))

# derivative of the sigmoid
def derv_sigmoid(x):
   return sigmoid(x)*(1-sigmoid(x))

With this we can have the output of first, second and third layer, using our equation of neural network forward propagation.

a0 = x
a1 = sigmoid(np.dot(a0,w1))

a2 = sigmoid(np.dot(a1,w2))

a2 is the calculated output from randomly initialized weights. So lets calculate the error by subtracting this from the expected value and taking the MSE.

\[C = \frac{1}{2} \|y-a^l\|^2\]

c0 = ((y-a2)**2)/2

Now we need to use the back-propagation algorithm to calculate how each weight has influenced the error and reduce it proportionally.

We use this to update weights in all the layers and do forward pass again, re-calculate the error and loss, then re-calculate the error gradient $\frac{\partial C}{\partial w}$ and repeat

\[\begin{aligned} w^2 = w^2 - (\frac {\partial C}{\partial w^2} )*learningRate \\ \\ w^1 = w^1 - (\frac {\partial C}{\partial w^1} )*learningRate \end{aligned}\]

Let’s update the weights as per the formula derived in the previous chapter:

\[\mathbf{ \frac {\partial C}{\partial w^1} = \sigma'(z^1) * (a^{0})^T*\delta^{2}*w^2.\sigma'(z^2) \quad \rightarrow \mathbb Eq \; (5) }\] \[\delta^2 = (a^2-y)\] \[\mathbf{ \frac {\partial C}{\partial w^2}= \delta^{2}*\sigma^{'}(z^2) * (a^{1})^T \quad \rightarrow \mathbb Eq \; (3) }\]

A Two layered Neural Network in Python

Below is a two layered Network; I have used the code from http://iamtrask.github.io/2015/07/12/basic-python-network/ as the basis. With minor changes to fit into how we derived the equations.

import numpy as np
# seed random numbers to make calculation deterministic 
np.random.seed(1)

# pretty print numpy array
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

# let us code our sigmoid function
def sigmoid(x):
    return 1/(1+np.exp(-x))

# let us add a method that takes the derivative of x as well
def derv_sigmoid(x):
   return sigmoid(x)*(1-sigmoid(x))

#---------------------------------------------------------------

# Two layered NW. Using from (1) and the equations we derived as explanations
# (1) http://iamtrask.github.io/2015/07/12/basic-python-network/
#---------------------------------------------------------------

# set learning rate as 1 for this toy example
learningRate = 1

# input x, also used as the training set here
x = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])

# desired output for each of the training set above
y = np.array([[0,1,1,0]]).T

# Explanation - as long as input has two ones, but not three, output is One
"""
Input [0,0,1]  Output = 0
Input [0,1,1]  Output = 1
Input [1,0,1]  Output = 1
Input [1,1,1]  Output = 0
"""

# Randomly initialized weights
weight1 =  np.random.random((3,4)) 
weight2 =  np.random.random((4,1)) 

# Activation to layer 0 is taken as input x
a0 = x

iterations = 1000
for iter in range(0,iterations):

  # Forward pass - Straight Forward
  z1= np.dot(x,weight1)
  a1 = sigmoid(z1) 
  z2= np.dot(a1,weight2)
  a2 = sigmoid(z2) 
  if iter == 0:
    print("Initial Output \n",a2)

  # Backward Pass - Backpropagation 
  delta2  = (a2-y)
  #---------------------------------------------------------------
  # Calculating change of Cost/Loss wrto weight of 2nd/last layer
  # Eq (A) ---> dC_dw2 = delta2*derv_sigmoid(z2)*a1.T
  #---------------------------------------------------------------

  dC_dw2_1  = delta2*derv_sigmoid(z2) 
  dC_dw2  = a1.T.dot(dC_dw2_1)
  
  #---------------------------------------------------------------
  # Calculating change of Cost/Loss wrto weight of 2nd/last layer
  # Eq (B)---> dC_dw1 = derv_sigmoid(z1)*delta2*derv_sigmoid(z2)*weight2*a0.T
  # dC_dw1 = derv_sigmoid(z1)*dC_dw2*weight2_1*a0.T
  #---------------------------------------------------------------

  dC_dw1 =  np.dot(dC_dw2_1, weight2.T) * derv_sigmoid(z1)
  dC_dw1 = a0.T.dot(dC_dw1)

  #---------------------------------------------------------------
  #Gradient descent
  #---------------------------------------------------------------
 
  weight2 = weight2 - learningRate*(dC_dw2)
  weight1 = weight1 - learningRate*(dC_dw1)


print("New output",a2)

#---------------------------------------------------------------
# Training is done, weight2 and weight2 are primed for output y
#---------------------------------------------------------------

# Lets test out, two ones in input and one zero, output should be One
x = np.array([[1,0,1]])
z1= np.dot(x,weight1)
a1 = sigmoid(z1) 
z2= np.dot(a1,weight2)
a2 = sigmoid(z2) 
print("Output after Training is \n",a2)

Output

Initial Output 
 [[ 0.758]
 [ 0.771]
 [ 0.791]
 [ 0.801]]
New output [[ 0.028]
 [ 0.925]
 [ 0.925]
 [ 0.090]]
Output after Training is 
 [[ 0.925]]

We have trained the NW for getting the output similar to $y$; that is [0,1,0,1]

References

Index

Alex Punnen's Home Page Articles and Links

Neural Network Implementation

A Two layered Neural Network in Python

References

Alex Punnen's Home Page
Articles and Links