PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz

UCSB

*Before your section meetings this week:*

- install python
- complete pre-lab activity (to be posted with lab)

*Part of your assignment this week:* fill out midquarter self-evaluation.

Add code request form for capstones later this week. Think now about whether you’re interested and check the schedule.

What do you know (or have you heard) about neural networks?

what are they

what are they used for

any other info?

Consider an arbitrary statistical model with one response \(Y\) and three predictors \(x_1, x_2, x_3\).

A simple diagram of the model would look like this:

A model that maps predictors directly to the response has just two ** “layers”**:

an input layer \(X\)

an output layer \(Y\) (or more accurately \(\mathbb{E}Y\))

Neural networks add layers between the input and output.

one input layer

one hidden layer

one output layer

one parameter per edge

Let \(Y\in\mathbb{R}^n\) and \(X \in \mathbb{R}^{n\times p}\) represent some data. The **vanilla neural network** is:

- \(\sigma_x, \sigma_z\) are (known)
*activation*functions - \(\color{#8b3e2f}{\beta}, \color{#8b3e2f}{\alpha}\) are
*weights*(model parameters)- \(p(M + 1)\) of them as written

Notice that the output is simply a long composition:

\[ Y = f(X) \quad\text{where}\quad f \equiv \sigma_z \circ h_\beta\circ \sigma_x \circ h_\alpha \]

each function is either known or linear

compute parameters by minimizing a loss function

minimization by gradient descent

Denoting the parameter vector by \(\theta = \left(\alpha^T \; \beta^T\right)\), initialize \(\theta^{(0)}\) and repeat:

\[ \theta^{(r + 1)} \longleftarrow \theta^{(r)} + c_r \nabla L^{(r)} \]

\(L^{(r)}\) is a loss function evaluated at the \(r\)th iteration

- of the form \(L^{(r)} = \frac{1}{n}\sum_i L_i (\theta^{(r)}, Y)\)

\(c_r\) is the ‘learning rate’; can be fixed or chosen adaptively

each cycle through all the parameters is one ‘epoch’

Individual parameter updates at the \(r\)th iteration are given by:

\[ \beta_{m}^{(r + 1)} \longleftarrow \beta_{m}^{(r)} + c_r \underbrace{\frac{1}{n}\sum_{i = 1}^n \frac{\partial L_i}{\partial \beta_{m}}\Big\rvert_{\beta_{m} = \beta_{m}^{(r)}}}_{\text{gradient at current iteration}} \\ \alpha_{mp}^{(r + 1)} \longleftarrow \alpha_{mp}^{(r)} + c_r \underbrace{\frac{1}{n}\sum_{i = 1}^n \frac{\partial L_i}{\partial \alpha_{mp}}\Big\rvert_{\alpha_{mp} = \alpha_{mp}^{(r)}}}_{\text{gradient at current iteration}} \]

The gradient is easy to compute. Denoting \(t_{i} = z_{i}^T\beta\):

\[ \begin{aligned} \frac{\partial L_i}{\partial \alpha_{mp}} &= \underbrace{\frac{\partial L_i}{\partial f} \frac{\partial f}{\partial t_i}}_{\delta_i} \underbrace{\frac{\partial t_i}{\partial z_{im}} \frac{\partial z_{im}}{\partial \alpha_{mp}}}_{s_{im}x_{ip}} \\ \frac{\partial L_i}{\partial \beta_{m}} &= \underbrace{\frac{\partial L_i}{\partial f} \frac{\partial f}{\partial t_i}}_{\delta_i} \underbrace{\frac{\partial t_i}{\partial \beta_{m}}}_{z_{im}} \end{aligned} \]

Explicitly computing gradients for each update gives the *backpropagation* algorithm of Rumelhart, Hinton, and Williams (1986) .

Initialize parameters and repeat:

**Forward pass**: compute \(f(X), Z\)**Backward pass**: compute \(\delta_i, s_{mi}\) by ‘back-propagating’ current estimatesUpdate the weights

\[ \hat{\beta}_{km} \longleftarrow \hat{\beta}_{km} + c_r \frac{1}{n}\sum_i \delta_{ki}z_{mi} \\ \hat{\alpha}_{mp} \longleftarrow \hat{\alpha}_{mp} + c_r \frac{1}{n}\sum_i s_{mi}x_{ip} \]

Explicitly computing the gradient sums over all observations \(i = 1, \dots, n\):

\[ g = \nabla \frac{1}{n} \sum_i L_i \]

It’s much faster to *estimate* the gradient based on a “batch” of \(m\) observations (subsample) \(J \subset \{1, \dots, n\}\):

\[ \hat{g} = \nabla \frac{1}{m}\sum_{i \in J} L_i \]

Modern methods for training neural networks update parameters using gradient estimates and adaptive learning rates.^{1}

One more hidden unit.

One more hidden layer.

Networks of arbitrary width and depth in which the connectivity is uni-directional are known as “sequential” or “feedforward” networks/models.

\[ \begin{aligned} \mathbb{E}Y &= \sigma_1(Z_1\beta_1) &\text{output layer}\\ Z_k &= \sigma_k(Z_{k - 1} \beta_k) &\text{hidden layers } k = 2, \dots, D - 1\\ Z_D &\equiv X &\text{input layer} \end{aligned} \]

- chain rule calculations get longer but are otherwise the same
- “universal approximation” properties

Suppose:

\(\mathbb{E}Y = f(X)\) gives the ‘true’ relationship

\(\tilde{f}(X)\) represents the output layer of a feedforward neural network with one hidden layer of width \(w\)

Hornik, Stinchcombe, and White (1989) showed that, under some regularity conditions, for any \(\epsilon > 0\) there exists a width \(w\) and parameters such that:

\[ \sup_x \|f(x) - \tilde{f}(x)\| < \epsilon \]

Similar results exist for deep networks with bounded width^{1}.

These results

tell us that in most problems there exist both deep and shallow networks that approximate the true input-output relationship arbitrarily well*do*They

tell us how to find them.*don’t*

Several factors can affect actual performance in practice:

*architecture*(network structure)- activation function(s)
- loss function
- optimization method
- parameter initialization and training epochs
- data quality (don’t forget this one!)

Activation functions \(\sigma(\cdot)\) determine whether a given unit ‘fires’.

For example:

if \(Z_{k - 1}\beta_{kj} = -28.2\) and \(\sigma_k(x) = \frac{1}{1 + e^{-x}}\),

then \(z_{kj} = \sigma_k(Z_{k - 1}\beta_j) \approx 0\).

The most common activation functions are:

(identity) \(\sigma(x) = x\)

(sigmoid) \(\sigma(x) = \frac{1}{1 + \exp\{-x\}}\)

(hyperbolic tangent) \(\sigma(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

(rectified linear unit) \(\sigma(x) = \max (0, x)\)

The most common loss function for classification is

\[ L(Y, f(X)) = -\frac{1}{n}\sum_i \left[y_i\log p_i + (1 - y_i)\log(1 - p_i)\right] \qquad\text{(cross-entropy)} \]

The most common loss function for regression is:

\[ L(Y, f(X)) = \frac{1}{n}\sum_i (y_i - f(x_i))^2 \qquad\text{(mean squared error)} \]

Bottou, Léon et al. 1998. “Online Learning and Stochastic Approximations.” *On-Line Learning in Neural Networks* 17 (9): 142.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” *Journal of Machine Learning Research* 12 (7).

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. *Deep Learning*. MIT Press.

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” *Neural Networks* 2 (5): 359–66.

Kingma, Diederik P, and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” *arXiv Preprint arXiv:1412.6980*.

Lu, Zhou, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. 2017. “The Expressive Power of Neural Networks: A View from the Width.” *Advances in Neural Information Processing Systems* 30.

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. “Learning Representations by Back-Propagating Errors.” *Nature* 323 (6088): 533–36.