Artificial Neural Networks

PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz



Before your section meetings this week:

Part of your assignment this week: fill out midquarter self-evaluation.

Add code request form for capstones later this week. Think now about whether you’re interested and check the schedule.

Neural Networks

Discuss with your table (3min)

What do you know (or have you heard) about neural networks?

  • what are they

  • what are they used for

  • any other info?

Graphical model diagram

Consider an arbitrary statistical model with one response \(Y\) and three predictors \(x_1, x_2, x_3\).

A simple diagram of the model would look like this:

G cluster_0 predictors cluster_2 response in1 x1 o1 Y in1->o1 in2 x2 in2->o1 in3 x3 in3->o1

Graph layers

A model that maps predictors directly to the response has just two “layers”:

  • an input layer \(X\)

  • an output layer \(Y\) (or more accurately \(\mathbb{E}Y\))

Neural networks add layers between the input and output.

‘Vanilla’ neural network

G cluster_0 cluster_2 cluster_1 in1 h1 in1->h1 h2 in1->h2 h3 in1->h3 h4 in1->h4 in2 in2->h1 in2->h2 in2->h3 in2->h4 in3 in3->h1 in3->h2 in3->h3 in3->h4 o1 h1->o1 h2->o1 h3->o1 h4->o1
  • one input layer

  • one hidden layer

  • one output layer

  • one parameter per edge

More formally

Let \(Y\in\mathbb{R}^n\) and \(X \in \mathbb{R}^{n\times p}\) represent some data. The vanilla neural network is:

\[\begin{aligned} \color{#eed5b7}{\mathbb{E}Y} &= \sigma_z(\color{#7ac5cd}{Z}\color{#8b3e2f}{\beta}) \qquad &\text{output layer}\\ \color{#7ac5cd}{Z} &= \left[\sigma_x(\color{#66cdaa}{X}\color{#8b3e2f}{\alpha_1}) \;\cdots\; \sigma_x(\color{#66cdaa}{X}\color{#8b3e2f}{\alpha_M})\right] \qquad &\text{hidden layer} \\ \color{#66cdaa}{X} &= \left[x_1 \;\cdots\; x_p\right] \qquad&\text{input layer} \end{aligned}\]
  • \(\sigma_x, \sigma_z\) are (known) activation functions
  • \(\color{#8b3e2f}{\beta}, \color{#8b3e2f}{\alpha}\) are weights (model parameters)
    • \(p(M + 1)\) of them as written

Training a network

Notice that the output is simply a long composition:

\[ Y = f(X) \quad\text{where}\quad f \equiv \sigma_z \circ h_\beta\circ \sigma_x \circ h_\alpha \]

  • each function is either known or linear

  • compute parameters by minimizing a loss function

  • minimization by gradient descent

Gradient descent

Denoting the parameter vector by \(\theta = \left(\alpha^T \; \beta^T\right)\), initialize \(\theta^{(0)}\) and repeat:

\[ \theta^{(r + 1)} \longleftarrow \theta^{(r)} + c_r \nabla L^{(r)} \]

  • \(L^{(r)}\) is a loss function evaluated at the \(r\)th iteration

    • of the form \(L^{(r)} = \frac{1}{n}\sum_i L_i (\theta^{(r)}, Y)\)
  • \(c_r\) is the ‘learning rate’; can be fixed or chosen adaptively

  • each cycle through all the parameters is one ‘epoch’

Updates for the VNN

Individual parameter updates at the \(r\)th iteration are given by:

\[ \beta_{m}^{(r + 1)} \longleftarrow \beta_{m}^{(r)} + c_r \underbrace{\frac{1}{n}\sum_{i = 1}^n \frac{\partial L_i}{\partial \beta_{m}}\Big\rvert_{\beta_{m} = \beta_{m}^{(r)}}}_{\text{gradient at current iteration}} \\ \alpha_{mp}^{(r + 1)} \longleftarrow \alpha_{mp}^{(r)} + c_r \underbrace{\frac{1}{n}\sum_{i = 1}^n \frac{\partial L_i}{\partial \alpha_{mp}}\Big\rvert_{\alpha_{mp} = \alpha_{mp}^{(r)}}}_{\text{gradient at current iteration}} \]

Chain rule

The gradient is easy to compute. Denoting \(t_{i} = z_{i}^T\beta\):

\[ \begin{aligned} \frac{\partial L_i}{\partial \alpha_{mp}} &= \underbrace{\frac{\partial L_i}{\partial f} \frac{\partial f}{\partial t_i}}_{\delta_i} \underbrace{\frac{\partial t_i}{\partial z_{im}} \frac{\partial z_{im}}{\partial \alpha_{mp}}}_{s_{im}x_{ip}} \\ \frac{\partial L_i}{\partial \beta_{m}} &= \underbrace{\frac{\partial L_i}{\partial f} \frac{\partial f}{\partial t_i}}_{\delta_i} \underbrace{\frac{\partial t_i}{\partial \beta_{m}}}_{z_{im}} \end{aligned} \]

Explicitly computing gradients for each update gives the backpropagation algorithm of Rumelhart, Hinton, and Williams (1986) .


Initialize parameters and repeat:

  1. Forward pass: compute \(f(X), Z\)

  2. Backward pass: compute \(\delta_i, s_{mi}\) by ‘back-propagating’ current estimates

  3. Update the weights
    \[ \hat{\beta}_{km} \longleftarrow \hat{\beta}_{km} + c_r \frac{1}{n}\sum_i \delta_{ki}z_{mi} \\ \hat{\alpha}_{mp} \longleftarrow \hat{\alpha}_{mp} + c_r \frac{1}{n}\sum_i s_{mi}x_{ip} \]

Gradient estimation

Explicitly computing the gradient sums over all observations \(i = 1, \dots, n\):

\[ g = \nabla \frac{1}{n} \sum_i L_i \]

It’s much faster to estimate the gradient based on a “batch” of \(m\) observations (subsample) \(J \subset \{1, \dots, n\}\):

\[ \hat{g} = \nabla \frac{1}{m}\sum_{i \in J} L_i \]

Modern optimization methods

Modern methods for training neural networks update parameters using gradient estimates and adaptive learning rates.1

  • stochastic gradient descent (SGD): Bottou et al. (1998) replace \(g\) by \(\hat{g}\)
  • AdaGrad: Duchi, Hazan, and Singer (2011) use SGD with adaptive learning rates
  • Adam: Kingma and Ba (2014) apply bias corrections to \(\hat{g}\) based on moment estimates

Increasing width

G cluster_0 cluster_1 cluster_2 in1 h1 in1->h1 h2 in1->h2 h3 in1->h3 h4 in1->h4 h5 in1->h5 in2 in2->h1 in2->h2 in2->h3 in2->h4 in2->h5 in3 in3->h1 in3->h2 in3->h3 in3->h4 in3->h5 o1 h1->o1 h2->o1 h3->o1 h4->o1 h5->o1 a o1->a b width o1->b c o1->c a->b b->c

One more hidden unit.

Increasing depth

G cluster_3 cluster_1 cluster_2 cluster_0 in1 h1 in1->h1 h2 in1->h2 h3 in1->h3 h4 in1->h4 h5 in1->h5 in2 in2->h1 in2->h2 in2->h3 in2->h4 in2->h5 in3 in3->h1 in3->h2 in3->h3 in3->h4 in3->h5 z1 h1->z1 z2 h1->z2 z3 h1->z3 z4 h1->z4 h2->z1 h2->z2 h2->z3 h2->z4 h3->z1 h3->z2 h3->z3 h3->z4 h4->z1 h4->z2 h4->z3 h4->z4 h5->z1 h5->z2 h5->z3 h5->z4 o1 z1->o1 z2->o1 z3->o1 z4->o1 a b depth a->b c b->c

One more hidden layer.

Sequential networks

Networks of arbitrary width and depth in which the connectivity is uni-directional are known as “sequential” or “feedforward” networks/models.

\[ \begin{aligned} \mathbb{E}Y &= \sigma_1(Z_1\beta_1) &\text{output layer}\\ Z_k &= \sigma_k(Z_{k - 1} \beta_k) &\text{hidden layers } k = 2, \dots, D - 1\\ Z_D &\equiv X &\text{input layer} \end{aligned} \]

  • chain rule calculations get longer but are otherwise the same
  • “universal approximation” properties

Approximation properties


  • \(\mathbb{E}Y = f(X)\) gives the ‘true’ relationship

  • \(\tilde{f}(X)\) represents the output layer of a feedforward neural network with one hidden layer of width \(w\)

Hornik, Stinchcombe, and White (1989) showed that, under some regularity conditions, for any \(\epsilon > 0\) there exists a width \(w\) and parameters such that:

\[ \sup_x \|f(x) - \tilde{f}(x)\| < \epsilon \]

Approximation properties

Similar results exist for deep networks with bounded width1.

  • These results do tell us that in most problems there exist both deep and shallow networks that approximate the true input-output relationship arbitrarily well

  • They don’t tell us how to find them.

Performance considerations

Several factors can affect actual performance in practice:

  1. architecture (network structure)
  2. activation function(s)
  3. loss function
  4. optimization method
  5. parameter initialization and training epochs
  6. data quality (don’t forget this one!)


Activation functions \(\sigma(\cdot)\) determine whether a given unit ‘fires’.

G cluster_0 unit1 sum Σ unit1->sum unit2 unit2->sum unit3 unit3->sum activation σ(⋅) sum->activation out activation->out

For example:

  • if \(Z_{k - 1}\beta_{kj} = -28.2\) and \(\sigma_k(x) = \frac{1}{1 + e^{-x}}\),

  • then \(z_{kj} = \sigma_k(Z_{k - 1}\beta_j) \approx 0\).

Common activation functions

The most common activation functions are:

  • (identity) \(\sigma(x) = x\)

  • (sigmoid) \(\sigma(x) = \frac{1}{1 + \exp\{-x\}}\)

  • (hyperbolic tangent) \(\sigma(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

  • (rectified linear unit) \(\sigma(x) = \max (0, x)\)

Loss functions

The most common loss function for classification is

\[ L(Y, f(X)) = -\frac{1}{n}\sum_i \left[y_i\log p_i + (1 - y_i)\log(1 - p_i)\right] \qquad\text{(cross-entropy)} \]

The most common loss function for regression is:

\[ L(Y, f(X)) = \frac{1}{n}\sum_i (y_i - f(x_i))^2 \qquad\text{(mean squared error)} \]


Bottou, Léon et al. 1998. “Online Learning and Stochastic Approximations.” On-Line Learning in Neural Networks 17 (9): 142.
Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12 (7).
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66.
Kingma, Diederik P, and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980.
Lu, Zhou, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. 2017. “The Expressive Power of Neural Networks: A View from the Width.” Advances in Neural Information Processing Systems 30.
Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36.