Artificial Neural Networks

PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz

UCSB

Announcements/reminders

Before your section meetings this week:

install python
complete pre-lab activity (to be posted with lab)

Part of your assignment this week: fill out midquarter self-evaluation.

Add code request form for capstones later this week. Think now about whether you’re interested and check the schedule.

Neural Networks

Discuss with your table (3min)

What do you know (or have you heard) about neural networks?

what are they
what are they used for
any other info?

Graphical model diagram

Consider an arbitrary statistical model with one response \(Y\) and three predictors \(x_1, x_2, x_3\).

A simple diagram of the model would look like this:

Graph layers

A model that maps predictors directly to the response has just two “layers”:

an input layer \(X\)
an output layer \(Y\) (or more accurately \(\mathbb{E}Y\))

Neural networks add layers between the input and output.

‘Vanilla’ neural network

one input layer
one hidden layer
one output layer
one parameter per edge

More formally

Let \(Y\in\mathbb{R}^n\) and \(X \in \mathbb{R}^{n\times p}\) represent some data. The vanilla neural network is:

\[\begin{aligned} \color{#eed5b7}{\mathbb{E}Y} &= \sigma_z(\color{#7ac5cd}{Z}\color{#8b3e2f}{\beta}) \qquad &\text{output layer}\\ \color{#7ac5cd}{Z} &= \left[\sigma_x(\color{#66cdaa}{X}\color{#8b3e2f}{\alpha_1}) \;\cdots\; \sigma_x(\color{#66cdaa}{X}\color{#8b3e2f}{\alpha_M})\right] \qquad &\text{hidden layer} \\ \color{#66cdaa}{X} &= \left[x_1 \;\cdots\; x_p\right] \qquad&\text{input layer} \end{aligned}\]

\(\sigma_x, \sigma_z\) are (known) activation functions
\(\color{#8b3e2f}{\beta}, \color{#8b3e2f}{\alpha}\) are weights (model parameters)
- \(p(M + 1)\) of them as written

Training a network

Notice that the output is simply a long composition:

\[ Y = f(X) \quad\text{where}\quad f \equiv \sigma_z \circ h_\beta\circ \sigma_x \circ h_\alpha \]

each function is either known or linear
compute parameters by minimizing a loss function
minimization by gradient descent

Gradient descent

Denoting the parameter vector by \(\theta = \left(\alpha^T \; \beta^T\right)\), initialize \(\theta^{(0)}\) and repeat:

\[ \theta^{(r + 1)} \longleftarrow \theta^{(r)} + c_r \nabla L^{(r)} \]

\(L^{(r)}\) is a loss function evaluated at the \(r\)th iteration
- of the form \(L^{(r)} = \frac{1}{n}\sum_i L_i (\theta^{(r)}, Y)\)
\(c_r\) is the ‘learning rate’; can be fixed or chosen adaptively
each cycle through all the parameters is one ‘epoch’

Updates for the VNN

Individual parameter updates at the \(r\)th iteration are given by:

\[ \beta_{m}^{(r + 1)} \longleftarrow \beta_{m}^{(r)} + c_r \underbrace{\frac{1}{n}\sum_{i = 1}^n \frac{\partial L_i}{\partial \beta_{m}}\Big\rvert_{\beta_{m} = \beta_{m}^{(r)}}}_{\text{gradient at current iteration}} \\ \alpha_{mp}^{(r + 1)} \longleftarrow \alpha_{mp}^{(r)} + c_r \underbrace{\frac{1}{n}\sum_{i = 1}^n \frac{\partial L_i}{\partial \alpha_{mp}}\Big\rvert_{\alpha_{mp} = \alpha_{mp}^{(r)}}}_{\text{gradient at current iteration}} \]

Chain rule

The gradient is easy to compute. Denoting \(t_{i} = z_{i}^T\beta\):

\[ \begin{aligned} \frac{\partial L_i}{\partial \alpha_{mp}} &= \underbrace{\frac{\partial L_i}{\partial f} \frac{\partial f}{\partial t_i}}_{\delta_i} \underbrace{\frac{\partial t_i}{\partial z_{im}} \frac{\partial z_{im}}{\partial \alpha_{mp}}}_{s_{im}x_{ip}} \\ \frac{\partial L_i}{\partial \beta_{m}} &= \underbrace{\frac{\partial L_i}{\partial f} \frac{\partial f}{\partial t_i}}_{\delta_i} \underbrace{\frac{\partial t_i}{\partial \beta_{m}}}_{z_{im}} \end{aligned} \]

Explicitly computing gradients for each update gives the backpropagation algorithm of Rumelhart, Hinton, and Williams (1986) .

Backpropagation

Initialize parameters and repeat:

Forward pass: compute \(f(X), Z\)
Backward pass: compute \(\delta_i, s_{mi}\) by ‘back-propagating’ current estimates
Update the weights
\[ \hat{\beta}_{km} \longleftarrow \hat{\beta}_{km} + c_r \frac{1}{n}\sum_i \delta_{ki}z_{mi} \\ \hat{\alpha}_{mp} \longleftarrow \hat{\alpha}_{mp} + c_r \frac{1}{n}\sum_i s_{mi}x_{ip} \]

Gradient estimation

Explicitly computing the gradient sums over all observations \(i = 1, \dots, n\):

\[ g = \nabla \frac{1}{n} \sum_i L_i \]

It’s much faster to estimate the gradient based on a “batch” of \(m\) observations (subsample) \(J \subset \{1, \dots, n\}\):

\[ \hat{g} = \nabla \frac{1}{m}\sum_{i \in J} L_i \]

Modern optimization methods

Modern methods for training neural networks update parameters using gradient estimates and adaptive learning rates.¹

stochastic gradient descent (SGD): Bottou et al. (1998) replace \(g\) by \(\hat{g}\)
AdaGrad: Duchi, Hazan, and Singer (2011) use SGD with adaptive learning rates
Adam: Kingma and Ba (2014) apply bias corrections to \(\hat{g}\) based on moment estimates

Increasing width

One more hidden unit.

Increasing depth

One more hidden layer.

Sequential networks

Networks of arbitrary width and depth in which the connectivity is uni-directional are known as “sequential” or “feedforward” networks/models.

\[ \begin{aligned} \mathbb{E}Y &= \sigma_1(Z_1\beta_1) &\text{output layer}\\ Z_k &= \sigma_k(Z_{k - 1} \beta_k) &\text{hidden layers } k = 2, \dots, D - 1\\ Z_D &\equiv X &\text{input layer} \end{aligned} \]

chain rule calculations get longer but are otherwise the same
“universal approximation” properties

Approximation properties

Suppose:

\(\mathbb{E}Y = f(X)\) gives the ‘true’ relationship
\(\tilde{f}(X)\) represents the output layer of a feedforward neural network with one hidden layer of width \(w\)

Hornik, Stinchcombe, and White (1989) showed that, under some regularity conditions, for any \(\epsilon > 0\) there exists a width \(w\) and parameters such that:

\[ \sup_x \|f(x) - \tilde{f}(x)\| < \epsilon \]

Approximation properties

Similar results exist for deep networks with bounded width¹.

These results do tell us that in most problems there exist both deep and shallow networks that approximate the true input-output relationship arbitrarily well
They don’t tell us how to find them.

Performance considerations

Several factors can affect actual performance in practice:

architecture (network structure)
activation function(s)
loss function
optimization method
parameter initialization and training epochs
data quality (don’t forget this one!)

Activations

Activation functions \(\sigma(\cdot)\) determine whether a given unit ‘fires’.

For example:

if \(Z_{k - 1}\beta_{kj} = -28.2\) and \(\sigma_k(x) = \frac{1}{1 + e^{-x}}\),
then \(z_{kj} = \sigma_k(Z_{k - 1}\beta_j) \approx 0\).

Common activation functions

The most common activation functions are:

(identity) \(\sigma(x) = x\)
(sigmoid) \(\sigma(x) = \frac{1}{1 + \exp\{-x\}}\)
(hyperbolic tangent) \(\sigma(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
(rectified linear unit) \(\sigma(x) = \max (0, x)\)

Loss functions

The most common loss function for classification is

\[ L(Y, f(X)) = -\frac{1}{n}\sum_i \left[y_i\log p_i + (1 - y_i)\log(1 - p_i)\right] \qquad\text{(cross-entropy)} \]

The most common loss function for regression is:

\[ L(Y, f(X)) = \frac{1}{n}\sum_i (y_i - f(x_i))^2 \qquad\text{(mean squared error)} \]

References

Bottou, Léon et al. 1998. “Online Learning and Stochastic Approximations.” On-Line Learning in Neural Networks 17 (9): 142.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12 (7).

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66.

Kingma, Diederik P, and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980.

Lu, Zhou, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. 2017. “The Expressive Power of Neural Networks: A View from the Width.” Advances in Neural Information Processing Systems 30.

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36.