BackPropagation from Scratch - An intuitive Approach

Teaching Assistants : Alex Mathai, Rishav Sinha

In [1]:
# Backpropagation from Scratch
import numpy as np
import os
from collections import OrderedDict
from copy import deepcopy

Most beginners try understanding back-propagation by rote learning formulae. This assignment is intended to convince you that such an approach is not required, as back propagation is quite inutitive if taught corretly. In this assignment I aim to teach you how to think about losses and gradients and how to derive back propagation by yourself without having to rely on deep learning frameworks.

Simple Beginnings with A Simple Network

Consider the simple network below. There is a single neuron with two inputs $X_{2}$ and $X_{1}$ connected by two weights ($W_{2}$ and $W_{1}$).

As shown below, let $W_{1} = 0.3$ and $W_{2}=0.7$.

So the network below can be summarized as $\hat{Y} = W_{2}*X_{2} + W_{1}*X{1}$

Let the ground truth be $Y_{g}$, which along with $\hat{Y}$ is fed to the squared error loss function.

So the $Loss = (1/2)*(Y_{g} - \hat{Y})^{2}$

In [2]:
%%html
<img src="Images/Simple_Network.jpg" width=300 height=300>

Concept 1 - Anti-Gradients are good for loss functions

To motivate this concept, let $X_{1} = -100$ and $X_{2}=1000$. Hence $\hat{Y} = 670$. Now let us say, that the $Y_{g} = 1000$.

It is apparent, that in order to reduce the loss, we need to increase $\hat{Y}$. Let us study how our loss changes by varying the weights of the network for this input.

Observations about Varying $W_{1}$

  1. By increasing $W_{1}$, we decrease $\hat{Y}$ thus increasing $Loss$.
  2. As increasing $W_{1}$ increases $Error$, we can say that $(\frac{\partial Loss}{\partial W_{1}}) > 0$ and so $(-\frac{\partial Loss}{\partial W_{1}}) < 0$.
  3. As our aim is to reduce loss, one subtask is to decrease $W_{1}$

Observations about Varying $W_{2}$

  1. By increasing $W_{2}$, we increase $\hat{Y}$ thus decreasing $Loss$.
  2. As increasing $W_{2}$ decreases $Loss$, we can say that $(\frac{\partial Loss}{\partial W_{2}}) < 0$ and so $(-\frac{\partial Loss}{\partial W_{2}}) > 0$.
  3. As our aim is to reduce loss, one subtask is to increase $W_{2}$

What do we want?

  1. From our observations above, we desire to decrease $W_{1}$ and increase $W_{2}$
  2. These desired changes can be summarized in the following formula : $W_{i} = W_{i} -\alpha*\frac{\partial Loss}{\partial W_{i}}$ or $W_{i} = W_{i} + \alpha*(-\frac{\partial Loss}{\partial W_{i}})$
  3. Verify for yourself, that by moving each $W_{i}$ in this manner we manage to reduce loss.

Hence if we change any parameter ($W_{i}$) along the opposite direction of $\frac{\partial Loss}{\partial W_{i}}$, we will end up reducing the loss. This is called moving along the anti-gradient.

In [ ]:
def apply_gradient_change(weight:np.ndarray,
                          gradient:np.ndarray,
                          alpha:float) -> np.ndarray:
    """ Takes the weight and applies the gradient change as mentioned above. It then
    returns the changed weight. 
    
    Max Marks : 1
    """
    # YOUR CODE HERE
    raise NotImplementedError()
In [ ]:
# Please run these hidden tests !!

assert np.array_equal(apply_gradient_change(np.array([1.,2.,3.,4.]),np.array([1.,2.,3.,4.]),0.001),np.array([0.999,1.998,2.997,3.996]))

Adding more neurons & the Error Graph

Let us add a few more neurons to the simple network in order to introduce another concept. Consider the network below. We will refer to this network as Complex Network in the following discussions.

  1. $ \hat{Y} = W_{1}*Z_{1} + W_{2}*Z_{2}$ : Here Pred is the same as $\hat{Y}$

    • $\frac{\partial \hat{Y}}{\partial Z_{1}} = \frac{\partial (W_{1}*Z_{1} + W_{2}*Z_{2})}{\partial Z_{1}} = W_{1}$
    • $\frac{\partial \hat{Y}}{\partial Z_{2}} = \frac{\partial (W_{1}*Z_{1} + W_{2}*Z_{2})}{\partial Z_{2}} = W_{2}$
  2. $Z_{1} = W_{3}*Z_{3} + W_{4}*Z_{4}$

    • $\frac{\partial Z_{1}}{\partial Z_{3}} = W_{3}$
    • $\frac{\partial Z_{1}}{\partial Z_{4}} = W_{4}$
  3. $Z_{2} = W_{5}*Z_{4}$

    • $\frac{\partial Z_{2}}{\partial Z_{4}} = W_{5}$
  4. $Z_{4} = W_{7}*X_{2}$

    • $\frac{\partial Z_{4}}{\partial X_{2}} = W_{7}$
  5. $Z_{3} = W_{6}*X_{1}$

    • $\frac{\partial Z_{3}}{\partial X_{1}} = W_{6}$

$Loss = (1/2)*(Y_{g} - \hat{Y})^{2}$

Note, that when we feed many points in a mini-batch at the same time, the mini-batch loss is the average of the loss calculated for each point.

$Loss = (1/2m)*(Y_{g} - \hat{Y})^{2}$ where m is the number of points in the mini-batch.

In [3]:
%%html
<img src="Images/Complex_Network.jpg" width=500 height=500>

As explained below, we observe a pattern when calculating the $(\frac{\partial Loss}{\partial Z_{i}})$ for each $Z_{i}$. If you are not interested in the math, then atleast take a look at the Simplified Meaning statements. We assume that m data points are sent in a mini-batch.

1) $Error_{Pred} = \frac{\partial Loss}{\partial \hat{Y}} = (1/m)*(\hat{Y}-Y_{g})$

2) $Error_{Z_{1}} = \frac{\partial Loss}{\partial Z_{1}} = \frac{\partial Loss}{\partial \hat{Y}}*\frac{\partial \hat{Y}}{\partial Z_{1}} = \frac{\partial Loss}{\partial \hat{Y}}*W_{1}$

----> Simplified Meaning : $Error_{Z_{1}} = Error_{Pred} * W_{1}$

3) $Error_{Z_{2}} = \frac{\partial Loss}{\partial Z_{2}} = \frac{\partial Loss}{\partial \hat{Y}}*\frac{\partial \hat{Y}}{\partial Z_{2}} = \frac{\partial Loss}{\partial \hat{Y}}*W_{2}$

----> Simplified Meaning : $Error_{Z_{2}} = Error_{Pred} * W_{2}$

4) $Loss_{Z_{3}} = \frac{\partial Loss}{\partial Z_{3}} = \frac{\partial Loss}{\partial \hat{Y}}*\frac{\partial \hat{Y}}{\partial Z_{3}} = \frac{\partial Loss}{\partial \hat{Y}}*( W_{1}*\frac{\partial Z_{1}}{\partial Z_{3}} + W_{2}*\frac{\partial Z_{2}}{\partial Z_{3}} ) = (\frac{\partial Loss}{\partial \hat{Y}}*W_{1})*\frac{\partial Z_{1}}{\partial Z_{3}} = (\frac{\partial Loss}{\partial Z_{1}})*W_{3}$

----> Simplified Meaning : $Error_{Z_{3}} = Error_{Z_{1}} * W_{3}$

5) $Loss_{Z_{4}} = \frac{\partial Loss}{\partial Z_{4}} = \frac{\partial Loss}{\partial \hat{Y}}*\frac{\partial \hat{Y}}{\partial Z_{4}} = \frac{\partial Loss}{\partial \hat{Y}}*( W_{1}*\frac{\partial Z_{1}}{\partial Z_{4}} + W_{2}*\frac{\partial Z_{2}}{\partial Z_{4}} ) = (\frac{\partial Loss}{\partial \hat{Y}}*W_{1})*\frac{\partial Z_{1}}{\partial Z_{4}} + (\frac{\partial Loss}{\partial \hat{Y}}*W_{2})*\frac{\partial Z_{2}}{\partial Z_{4}} = (\frac{\partial Loss}{\partial Z_{1}})*W_{4} + (\frac{\partial Loss}{\partial Z_{2}})*W_{5}$

----> Simplified Meaning : $Error_{Z_{4}} = Error_{Z_{1}} * W_{4} + Error_{Z_{2}} * W_{5}$

A pictorial representation of these errors is shown below. This graph representation of errors is known as a Error Graph. If you observe this carefully, this is the same network but with the edges reversed. Hence, it is said, that the flow of errors is exactly opposite to the flow of computation.

In [4]:
%%html
<img src="Images/Error_Graph.jpg" width=500 height=500>

Concept 2 - Connected neurons are like family

If you have understood the error graph, it will be easy for you to grasp concept 2. Concept 2 states that in order to find the error of a neuron, we only require to know the errors of the neurons it is connected to. This can be verified by all the error equations we had calculated.

Connected neurons are like family.

In [ ]:
def get_Errors_of_Complex_Network(y_hat:np.ndarray,
                                  y_ground:np.ndarray,
                                  All_weights:OrderedDict,
                                  All_Errors:OrderedDict) -> None:
    """ Calculates the errors for the Complex Network and stores these errors in "All_Errors". 
    
    Max Marks : 2
    
    Arguments:
        
        y_hat : Prediciton Array with size = [batch_size]
        y_ground : Ground truth Array with size = [batch_size]
        All_weights : Dictionary with all stored weights. The keys and values are listed below.
                      All_weights["W1"] contains weight W1
                      All_weights["W2"] contains weight W2
                      All_weights["W3"] contains weight W3
                      All_weights["W4"] contains weight W4
                      All_weights["W5"] contains weight W5
                      All_weights["W6"] contains weight W6
                      All_weights["W7"] contains weight W7
        All_Errors : Dictionary with all stored errors. The keys and values are listed below.
                     All_Errors["Pred"] should contain the Error of Pred
                     All_Errors["Z1"] should contain the Error of Z1
                     All_Errors["Z2"] should contain the Error of Z2
                     All_Errors["Z3"] should contain the Error of Z3
                     All_Errors["Z4"] should contain the Error of Z4
    """
    batch_size = len(y_ground)
    
    Error_Pred = (y_hat - y_ground)/batch_size
    # Please calculate Error_Z1,Error_Z2,Error_Z3,Error_Z4
    # YOUR CODE HERE
    raise NotImplementedError()

    try:
        assert(Error_Pred.shape == (batch_size,))
        assert(Error_Z1.shape == (batch_size,))
        assert(Error_Z2.shape == (batch_size,))
        assert(Error_Z3.shape == (batch_size,))
        assert(Error_Z4.shape == (batch_size,))
    except :
        print("Error Pred : {}".format(Error_Pred.shape))
        print("Error Z1 : {}".format(Error_Z1.shape))
        print("Error Z2 : {}".format(Error_Z2.shape))
        print("Error Z3 : {}".format(Error_Z3.shape))
        print("Error Z4 : {}".format(Error_Z4.shape))
        raise ValueError("Something's wrong!")
    
    All_Errors["Pred"] = Error_Pred
    All_Errors["Z1"] = Error_Z1
    All_Errors["Z2"] = Error_Z2
    All_Errors["Z3"] = Error_Z3
    All_Errors["Z4"] = Error_Z4
    
In [ ]:
# Please run these hidden tests !!

Network Collapse & Adding Activation Functions

When we do not add an activation function, we can perform a procedure known as Network Collapse. Consider, the complex network shown in the previous section. If we observe carefully, $\hat{Y}$ can directly be written as a function of $X_{1}$ and $X_{2}$.

For example, $\hat{Y} = {W_{1}}_{New}*X_{1} + {W_{2}}_{New}*X_{2}$ where ${W_{1}}_{New}$ is expressed as $W_{1}*W_{3}*W_{6}.$ Can you find out ${W_{2}}_{New}$?

In [ ]:
def calculate_W1_New(All_weights:OrderedDict)->float:
    """ Calculates W1_New of the collapsed network.
    
    Arguments :
        All_weights : Dictionary with all stored weights. The keys and values are listed below
                          All_weights["W1"] contains weight W1
                          All_weights["W2"] contains weight W2
                          All_weights["W3"] contains weight W3
                          All_weights["W4"] contains weight W4
                          All_weights["W5"] contains weight W5
                          All_weights["W6"] contains weight W6
                          All_weights["W7"] contains weight W7
    """
    return All_weights["W1"]*All_weights["W3"]*All_weights["W6"]
In [ ]:
def calculate_W2_New(All_weights:OrderedDict)->float:
    """ Calculates W2_New of the collapsed network.
    
    Max Marks : 1
    
    Arguments :
        All_weights : Dictionary with all stored weights. The keys and values are listed below
                          All_weights["W1"] contains weight W1
                          All_weights["W2"] contains weight W2
                          All_weights["W3"] contains weight W3
                          All_weights["W4"] contains weight W4
                          All_weights["W5"] contains weight W5
                          All_weights["W6"] contains weight W6
                          All_weights["W7"] contains weight W7
    """
    # YOUR CODE HERE
    raise NotImplementedError()
In [ ]:
# Please run these hidden tests !!
sample_weights = OrderedDict()
sample_weights["W1"] = 0.1
sample_weights["W2"] = 0.9
sample_weights["W3"] = 0.3
sample_weights["W4"] = 0.4
sample_weights["W5"] = 0.9
sample_weights["W6"] = 0.1
sample_weights["W7"] = 0.3

assert( calculate_W2_New(sample_weights) == 0.255)

We realize, that without adding an activation function, complex networks are indeed useless! Hence let us now add an activation function. Right below is an example of a very minimal network in order to showcase what differences arise in the backprop when we incorporate an activation function. In this example, we use the sigmoid function.

In [5]:
%%html
<img src="Images/Activation.jpg" width=500 height=500>

Let us list down all the computation equations.

  1. $\hat{Y} = \sigma{(Z_{2})}*W_{2}$
    • $\frac{\partial \hat{Y}}{\partial Z_{2}} = W_{2}*\frac{\partial (\sigma{(Z_{2})})}{\partial Z_{2}} = W_{2}*\sigma{(Z_{2})}*(1-\sigma{(Z_{2})})$
  2. $Z_{2} = Z_{1}*W_{1}$
    • $\frac{\partial Z_{2}}{\partial Z_{1}} = W_{1}$

Let us list down all the backprop equations.

  1. $Error_{Pred} = \frac{\partial Loss}{\partial \hat{Y}} = (1/m)*(\hat{Y} - Y_{g})$

  2. $Error_{Z_{2}} = \frac{\partial Loss}{\partial Z_{2}} = \frac{\partial Loss}{\partial \hat{Y}} * \frac{\partial \hat{Y}}{\partial Z_{2}} = Error_{Pred}*W_{2}*\sigma{(Z_{2})}*(1-\sigma{(Z_{2})}) = Error_{Pred}*W_{2}*differentation(\sigma(Z_{2}))$

  3. $Error_{Z_{1}} = \frac{\partial Loss}{\partial Z_{1}} = (\frac{\partial Loss}{\partial \hat{Y}} * \frac{\partial \hat{Y}}{\partial Z_{2}}) * \frac{\partial Z_{2}}{\partial Z_{1}} = (Error_{Z_{2}})*W_{1}$

By observing equations 2 and 3 we understand a very simple thing.

$Error_{Z_{2}}$ would have been $Error_{Pred}*W_{2}$ if there were no activation function. By adding an activation function, $Error_{Pred}*W_{2}$ was simply multiplied by the differentiation of the activation function which is $\sigma{(Z_{2})}*(1-\sigma{(Z_{2})})$.

$Error_{Z_{1}}$ is the same as there is no activation function after $Z_{1}$.

Concept 3 - Multiply the Differentiation of the Activation function

It is apparent now, that the only change in the backprop when adding an activation function, is the multiplication of the differentiation of the activation function.

In [ ]:
def sigmoid(x:np.ndarray) -> np.ndarray:
    """ This function returns the output of sigmoid(x). Where sigmoid(x) = 1/(1+e^{-x}). Make sure that you
    first make a copy of "x" using np.copy(). Perform operations on this copy and return it.
    
    Max Marks : 1
    
    Arguments:
        x : Input to sigmoid function
    """
    # YOUR CODE HERE
    raise NotImplementedError()
In [ ]:
# Please run these hidden tests !!

assert(sigmoid(np.array([0])) == np.array([0.5]))
In [ ]:
def diff_sigmoid(x:np.ndarray) -> np.ndarray:
    """ This function returns the differentiation of sigmoid(x). Where sigmoid(x) = 1/(1+e^{-x}). Make sure that you
    first make a copy of "x" using np.copy(). Perform operations on this copy and return it.
    
    Max Marks : 1
    
    Arguments:
        x : The input to sigmoid function
    """
    # YOUR CODE HERE
    raise NotImplementedError()
In [ ]:
# Please run these hidden tests !!

assert(diff_sigmoid(np.array([0])) == 0.25)
In [ ]:
def get_Errors_with_Activation_Fn(y_hat:np.ndarray,
                                  y_ground:np.ndarray,
                                  All_weights:OrderedDict,
                                  All_Zs:OrderedDict,
                                  All_Errors:OrderedDict) -> None:
    """ Calculates the errors for the minimal network given above and stores these errors into "All_Errors".
    To incorporate the differentiation of the activation function, you can use the "diff_sigmoid" function.
    
    Max Marks : 2
    
    Arguments:
        
        y_hat : Prediction Array with size = [batch_size]
        y_ground : Ground truth Array with size = [batch_size]
        All_weights : Dictionary with all stored weights. The keys and values are listed below
                      All_weights["W1"] contains weight W1
                      All_weights["W2"] contains weight W2
        All_Zs : Dictionary with all stored outputs. The keys and values are listed below
                 All_Zs["Z1"] contains output Z1
                 All_Zs["Z2"] contains output Z2
        All_Errors : Dictionary with all stored errors. The keys and values are listed below.
                     All_Errors["Pred"] should contain the Error of Pred
                     All_Errors["Z1"] should contain the Error of Z1
                     All_Errors["Z2"] should contain the Error of Z2
    """

    batch_size = len(y_ground)
    
    Error_Pred = (y_hat - y_ground)/batch_size
    # Please calculate Error_Z2 and Error_Z1
    # YOUR CODE HERE
    raise NotImplementedError()
        
    try:
        assert(Error_Pred.shape == (batch_size,))
        assert(Error_Z1.shape == (batch_size,))
        assert(Error_Z2.shape == (batch_size,))
    except :
        print("Error Pred : {}".format(Error_Pred.shape))
        print("Error Z1 : {}".format(Error_Z1.shape))
        print("Error Z2 : {}".format(Error_Z2.shape))
        raise ValueError("Something's wrong!")
        
    
    All_Errors["Pred"] = Error_Pred
    All_Errors["Z2"] = Error_Z2
    All_Errors["Z1"] = Error_Z1
    
In [ ]:
# Please run these hidden tests !!

Let us now add an activation function to each node in the complex network. From now onwards, the combination of $Act_{i}(Z_{i})$ and $Z_{i}$ will be fused into a node called $N_{i}$. This modified network will be referred to as Modified Network in further discussions.

The list of activation functions ($Act_{i}$) corresponding to each node ($N_{i}$) are mentioned below.

$N_{1}$ => $Act_{1}$ => $ReLU$

$N_{2}$ => $Act_{2}$ => $Sigmoid$

$N_{3}$ => $Act_{3}$ => $ReLU$

$N_{4}$ => $Act_{4}$ => $Sigmoid$

In [6]:
%%html
<img src="Images/Modified_Network.jpg" width=500 height=500>
In [ ]:
def ReLU(x:np.ndarray) -> np.ndarray:
    """ Returns the output of Rectified Linear Unit Function. ReLU(x) is max(0,x). Make sure that you
    first make a copy of "x" using np.copy(). Perform operations on this copy and return it.
    
    Max Marks : 1
    
    Arguments:
        x : Input to ReLU function
    """
    # YOUR CODE HERE
    raise NotImplementedError()
In [ ]:
# Please run these hidden tests !!

assert( np.array_equal(ReLU(np.array([-0.2,0.5])),np.array([0,0.5])) )
In [ ]:
def diff_ReLU(x:np.ndarray) -> np.ndarray:
    """ Returns the differentiation of ReLU(x). Note, make sure that the differentiation of ReLU(x) at x=0 is 0. Make sure that you
    first make a copy of "x" using np.copy(). Perform operations on this copy and return it.
    
    Max Marks : 1
    
    Arguments:
        x : Input to the ReLU function
    """
    # YOUR CODE HERE
    raise NotImplementedError()
In [ ]:
# Please run these hidden tests !!

assert( np.array_equal(diff_ReLU(np.array([0.9,-0.5,0.])), np.array([1.,0.,0.])) )
In [ ]:
def get_Errors_of_Modified_Network(y_ground:np.ndarray,
                                   All_weights:OrderedDict,
                                   All_Zs:OrderedDict,
                                   All_Errors:OrderedDict) -> None:
    """ Calculates the errors of the modified network and stores these errors in "All_Errors". To incorporate the
    differentiation of sigmoid and ReLU, you can use "diff_sigmoid" and "diff_ReLU".
    
    Max Marks : 4
    
    Arguments :
        y_ground : Ground truth Array with size = [batch_size]
        
        All_weights : Dictionary with all stored weights. The keys and values are listed below
                          All_weights["W1"] contains weight W1
                          All_weights["W2"] contains weight W2
                          All_weights["W3"] contains weight W3
                          All_weights["W4"] contains weight W4
                          All_weights["W5"] contains weight W5
                          All_weights["W6"] contains weight W6
                          All_weights["W7"] contains weight W7
        
        All_Zs : Dictionary with all stored outputs. The keys and values are listed below
                    All_Zs["Pred"] contains output Pred
                    All_Zs["Z1"] contains output Z1
                    All_Zs["Z2"] contains output Z2
                    All_Zs["Z3"] contains output Z3
                    All_Zs["Z4"] contains output Z4

        All_Errors : Dictionary with all stored errors. The keys and values are listed below.
                         All_Errors["Pred"] should contain the Error of Pred
                         All_Errors["Z1"] should contain the Error of Z1
                         All_Errors["Z2"] should contain the Error of Z2
                         All_Errors["Z3"] should contain the Error of Z3
                         All_Errors["Z4"] should contain the Error of Z4
    """

    batch_size = len(y_ground)    
    Error_Pred = (All_Zs["Pred"] - y_ground)/batch_size
    # Please calculate Error_Z1, Error_Z2, Error_Z3, Error_Z4
    # YOUR CODE HERE
    raise NotImplementedError()
    
    try:
        assert(Error_Pred.shape == (batch_size,))
        assert(Error_Z1.shape == (batch_size,))
        assert(Error_Z2.shape == (batch_size,))
        assert(Error_Z3.shape == (batch_size,))
        assert(Error_Z4.shape == (batch_size,))
    except :
        print("Error Pred : {}".format(Error_Pred.shape))
        print("Error Z1 : {}".format(Error_Z1.shape))
        print("Error Z2 : {}".format(Error_Z2.shape))
        print("Error Z3 : {}".format(Error_Z3.shape))
        print("Error Z4 : {}".format(Error_Z4.shape))
        raise ValueError("Something's wrong!")
    
    
    All_Errors["Pred"] = Error_Pred
    All_Errors["Z1"] = Error_Z1
    All_Errors["Z2"] = Error_Z2
    All_Errors["Z3"] = Error_Z3
    All_Errors["Z4"] = Error_Z4
    
In [ ]:
# Please run these hidden tests !!

Updating the weight

Whenever we train neural networks, we perform two steps. The first step is to perform backpropagation and get the gradients. The second step is to use these gradients to change the weights. This second step is also known as Optimization.

If you recall, the update equation mentioned at the beginning was $W_{i} = W_{i} - \alpha*\frac{\partial Loss}{\partial W_{i}}$. But up until now, we have been finding out gradients for nodes i.e. $\frac{\partial Loss}{\partial Z_{i}}$ and not for weights i.e. $\frac{\partial Loss}{\partial W_{i}}$.

The procedure to calculate $\frac{\partial Loss}{\partial W_{i}}$ is very straight-forward. Consider, that the input to the weight $W_{i}$ is $Input_{i}$ and that the other end of the weight is connected to $Output_{i}$.

Then $\frac{\partial Loss}{\partial W_{i}} = \frac{\partial Loss}{\partial Output_{i}}*Input_{i}$.

Let us take two examples to better understand this.

  1. Consider $W_{7}$ in the modified network, the input is $X_{2}$ and the other end is connected to $Z_{4}$.
    • Hence $\frac{\partial Loss}{\partial W_{7}} = \frac{\partial Loss}{\partial Z_{4}}*X_{2} = Error_{Z_{4}}*X_{2} $
  2. Consider $W_{5}$ in the modified network, the input is $N_{4}$ and the other end is connected to $Z_{2}$. If you don't understand why the input is in $N$ terms and output is in $Z$ terms, break each node $N$ into $Z$ and $Act(Z)$ and then see the connections.
    • Hence $\frac{\partial Loss}{\partial W_{5}} = \frac{\partial Loss}{\partial Z_{2}}*N_{4} = Error_{Z_{2}}*N_{4}$
In [ ]:
def apply_Gradient_Change_to_Modified_Network(alpha:float,
                                               All_weights:OrderedDict,
                                               All_Errors:OrderedDict,
                                               All_Nodes:OrderedDict,
                                               All_Xs:OrderedDict) -> None:
    """ Apply Gradient Change to weights in the modified network. And store the new weights back in the 
    'All_weights' dictionary. 
    
    Max Marks : 3
    
    Arguments:
        
        alpha : Learning rate for gradient descent
        
        All_weights : Dictionary with all stored weights. The keys and values are listed below
                          All_weights["W1"] contains weight W1
                          All_weights["W2"] contains weight W2
                          All_weights["W3"] contains weight W3
                          All_weights["W4"] contains weight W4
                          All_weights["W5"] contains weight W5
                          All_weights["W6"] contains weight W6
                          All_weights["W7"] contains weight W7
                          
        All_Errors : Dictionary with all stored errors. The keys and values are listed below.
                     All_Errors["Pred"] contains the Error of Pred
                     All_Errors["Z1"] contains the Error of Z1
                     All_Errors["Z2"] contains the Error of Z2
                     All_Errors["Z3"] contains the Error of Z3
                     All_Errors["Z4"] contains the Error of Z4
                     
        All_Nodes : Dictionary with all stored outputs. The keys and values are listed below
                    All_Nodes["Pred"] contains output Pred
                    All_Nodes["N1"] contains output N1
                    All_Nodes["N2"] contains output N2
                    All_Nodes["N3"] contains output N3
                    All_Nodes["N4"] contains output N4
        
        All_Xs : Dictionary with all stored inputs. The keys and values are listed below
                    All_Nodes["X1"] contains input X1
                    All_Nodes["X2"] contains input X2
    
    """
    batch_size = len(All_Xs["X1"])
    
    # Calculate the gradients
    W1_gradient = All_Errors["Pred"]*All_Nodes["N1"]
    # Please calculate W2_gradient, W3_gradient, .... ,W7_gradient
    # YOUR CODE HERE
    raise NotImplementedError()
    
     # Making sure everything is correct
    try :
        assert(W1_gradient.shape == (batch_size,))
        assert(W2_gradient.shape == (batch_size,))
        assert(W3_gradient.shape == (batch_size,))
        assert(W4_gradient.shape == (batch_size,))
        assert(W5_gradient.shape == (batch_size,))
        assert(W6_gradient.shape == (batch_size,))
        assert(W7_gradient.shape == (batch_size,))
    except :
        print("W1 Gradient : {}".format(W1_gradient.shape))
        print("W2 Gradient : {}".format(W2_gradient.shape))
        print("W3 Gradient : {}".format(W3_gradient.shape))
        print("W4 Gradient : {}".format(W4_gradient.shape))
        print("W5 Gradient : {}".format(W5_gradient.shape))
        print("W6 Gradient : {}".format(W6_gradient.shape))
        print("W7 Gradient : {}".format(W7_gradient.shape))
    
    All_weights["W1"] -= alpha*np.sum(W1_gradient)
    # Please update all the weights W2,W3 .... W7
    # YOUR CODE HERE
    raise NotImplementedError()
    
    
In [ ]:
# Please run these hidden tests !!

Putting it all together

We now know how to calculate the backprop equations and how to use these gradients to change the weights. Let us assemble all these components together to complete the backprop and optimization of the modified network.

In [ ]:
def init_Weights():
    # Initialize the Weights of the network
    All_weights = OrderedDict()
    All_weights["W1"] = np.abs(np.random.randn(1)).item()
    All_weights["W2"] = np.abs(np.random.randn(1)).item()
    All_weights["W3"] = np.abs(np.random.randn(1)).item()
    All_weights["W4"] = np.abs(np.random.randn(1)).item()
    All_weights["W5"] = np.abs(np.random.randn(1)).item()
    All_weights["W6"] = np.abs(np.random.randn(1)).item()
    All_weights["W7"] = np.abs(np.random.randn(1)).item()
    
    return All_weights
In [ ]:
def init_Errors():
    # Initialize the Errors of the network
    All_Errors = OrderedDict()
    All_Errors["Pred"] = None
    All_Errors["Z1"] = None
    All_Errors["Z2"] = None
    All_Errors["Z3"] = None
    All_Errors["Z4"] = None
    
    return All_Errors
In [ ]:
def init_Nodes():
    # Initialize the Nodes of the network
    All_Nodes = OrderedDict()
    All_Nodes["Pred"] = None
    All_Nodes["N1"] = None
    All_Nodes["N2"] = None
    All_Nodes["N3"] = None
    All_Nodes["N4"] = None
    
    return All_Nodes 
In [ ]:
def init_Zs():
    # Initialize the Z's of the network
    All_Zs = OrderedDict()
    All_Zs["Pred"] = None
    All_Zs["Z1"] = None
    All_Zs["Z2"] = None
    All_Zs["Z3"] = None
    All_Zs["Z4"] = None
    
    return All_Zs
In [ ]:
def shuffle_data(x1:np.ndarray,
                 x2:np.ndarray,
                 y:np.ndarray):
    """ Shuffles data after every epoch """
    num_points = len(x1)
    g = np.arange(num_points)
    np.random.shuffle(g)
    
    x1 = x1[g]
    x2 = x2[g]
    y = y[g]
    
    return x1,x2,y
In [ ]:
def forward_prop(x1:np.ndarray,
                x2:np.ndarray,
                All_weights:OrderedDict,
                All_Nodes:OrderedDict,
                All_Zs:OrderedDict) -> None :
    """ Perform forward prop of modified network and store all the node values in the All_Nodes and the 
    All_Zs dictionaries. Use the "sigmoid" and "ReLU" functions you implemented earlier.
    
    Max Marks : 2
    
    Arguments :
        
        x1 : The x1 data points of the mini-batch
        
        x2 : The x2 data points of the mini-batch
        
        All_weights : Dictionary with all stored weights. The keys and values are listed below
                          All_weights["W1"] contains weight W1
                          All_weights["W2"] contains weight W2
                          All_weights["W3"] contains weight W3
                          All_weights["W4"] contains weight W4
                          All_weights["W5"] contains weight W5
                          All_weights["W6"] contains weight W6
                          All_weights["W7"] contains weight W7
        
        All_Nodes : Dictionary with all stored outputs. The keys and values are listed below
                    All_Nodes["Pred"] contains output Pred
                    All_Nodes["N1"] contains output N1
                    All_Nodes["N2"] contains output N2
                    All_Nodes["N3"] contains output N3
                    All_Nodes["N4"] contains output N4
                    
        All_Zs : Dictionary with all stored outputs. The keys and values are listed below
                    All_Zs["Pred"] contains output Pred
                    All_Zs["Z1"] contains output Z1
                    All_Zs["Z2"] contains output Z2
                    All_Zs["Z3"] contains output Z3
                    All_Zs["Z4"] contains output Z4
    """
    
    All_Zs["Z3"] = All_weights["W6"]*x1
    All_Nodes["N3"] = ReLU(All_Zs["Z3"])
    
    # Please calculate Z2,N2,Z3,N3,Z4,N4,Pred 
    # YOUR CODE HERE
    raise NotImplementedError()
    
    assert(np.array_equal(All_Zs["Pred"],All_Nodes["Pred"]))
    
    
In [ ]:
# Please run these hidden tests !!
In [ ]:
def main_function(epoch_num:int,batch_size:int, data_points:int, alpha:float,
                  x1_data:np.ndarray, x2_data:np.ndarray, y_data:np.ndarray,
                  All_Weights:OrderedDict, All_Nodes:OrderedDict, All_Zs:OrderedDict,
                  All_Errors:OrderedDict) -> None:
    """ The main function that will train your neural network for 1 epoch.
    
    Max Marks : 1
    
    Arguments:
        epoch_num : Count of which epoch is running
        batch_size : The batch size of each mini-batch input
        data_points : The total number of training data points
        alpha : The learning rate of training
        x1_data : All the x1 data points
        x2_data : All the x2 data points
        y_data : All the groundtruths
        All_Weights : A dictionary containing all the weights of the network
        All_Nodes : A dictionary containing all the outputs after activations of the network
                    Ex.  (N1, N2, N3 and N4) 
        All_Zs : A dictionary containing all the outputs before activations of the network
                 Ex. (Z1, Z2, Z3, Z4)
        All_Errors : A dictionary containing all the errors of the network
                     Ex. Error_Pred, Error_Z1, Error_Z2 and so on...
    """
      
    num_batches = int(np.ceil(data_points/batch_size))
    total_loss = 0

    # For each mini-batch
    for k in range(0,num_batches):

        # Input data
        x1_input = np.copy(np.squeeze(x1_data[k*batch_size:min((k+1)*batch_size,data_points)]))
        x2_input = np.copy(np.squeeze(x2_data[k*batch_size:min((k+1)*batch_size,data_points)]))
        y_input = np.copy(np.squeeze(y_data[k*batch_size:min((k+1)*batch_size,data_points)]))

        # Perform forward propagation - Call your appropriate function
        # YOUR CODE HERE
        raise NotImplementedError()

        # Perform back propagation to calculate the partial derivatives - Call your appropriate function
        # YOUR CODE HERE
        raise NotImplementedError()

        # Perform optimization and change weights - Call your appropriate function
        All_Xs = OrderedDict()
        All_Xs["X1"] = x1_input
        All_Xs["X2"] = x2_input
        # YOUR CODE HERE
        raise NotImplementedError()

        # Update running loss
        total_loss += np.sum((y_input-All_Nodes["Pred"])**2)

    print("Student : Epoch {} : Loss : {:.2f}".format(epoch_num,total_loss/data_points))
    print("\tW1 : {:.2f}, W2 : {:.2f}, W3 : {:.2f}, W4 : {:.2f}, W5 : {:.2f}, W6 : {:.2f}, W7 : {:.2f}\n".format(All_Weights["W1"],All_Weights["W2"],All_Weights["W3"],All_Weights["W4"],All_Weights["W5"],All_Weights["W6"],All_Weights["W7"]))    


def check_more() :
    num_epochs = 1000
    batch_size = 10
    data_points = 10000
    alpha = 0.001

    x1 = np.random.randn(data_points,1)
    x2 = np.random.randn(data_points,1)
    y = 3*ReLU(x1) + 4*x2
    All_Weights = init_Weights()
    All_Nodes = init_Nodes()
    All_Zs = init_Zs()
    All_Errors = init_Errors()


    print("Initial W1 : {:.2f}, W2 : {:.2f}, W3 : {:.2f}, W4 : {:.2f}, W5 : {:.2f}, W6 : {:.2f}, W7 : {:.2f}".format(All_Weights["W1"],All_Weights["W2"],All_Weights["W3"],All_Weights["W4"],All_Weights["W5"],All_Weights["W6"],All_Weights["W7"]))
    for k in range(num_epochs) :
        x1,x2,y = shuffle_data(x1,x2,y)
        main_function(k,batch_size,data_points,alpha,x1,x2,y,All_Weights,All_Nodes,All_Zs,All_Errors)

check_more()
In [ ]:
# Please run these hidden tests !!
In [ ]:
 
In [ ]: