PSTAT197A/CMPSC190DD Fall 2022
UCSB
Before your section meetings this week:
Part of your assignment this week: fill out midquarter self-evaluation.
Add code request form for capstones later this week. Think now about whether you’re interested and check the schedule.
What do you know (or have you heard) about neural networks?
what are they
what are they used for
any other info?
Consider an arbitrary statistical model with one response \(Y\) and three predictors \(x_1, x_2, x_3\).
A simple diagram of the model would look like this:
A model that maps predictors directly to the response has just two “layers”:
an input layer \(X\)
an output layer \(Y\) (or more accurately \(\mathbb{E}Y\))
Neural networks add layers between the input and output.
one input layer
one hidden layer
one output layer
one parameter per edge
Let \(Y\in\mathbb{R}^n\) and \(X \in \mathbb{R}^{n\times p}\) represent some data. The vanilla neural network is:
\[\begin{aligned} \color{#eed5b7}{\mathbb{E}Y} &= \sigma_z(\color{#7ac5cd}{Z}\color{#8b3e2f}{\beta}) \qquad &\text{output layer}\\ \color{#7ac5cd}{Z} &= \left[\sigma_x(\color{#66cdaa}{X}\color{#8b3e2f}{\alpha_1}) \;\cdots\; \sigma_x(\color{#66cdaa}{X}\color{#8b3e2f}{\alpha_M})\right] \qquad &\text{hidden layer} \\ \color{#66cdaa}{X} &= \left[x_1 \;\cdots\; x_p\right] \qquad&\text{input layer} \end{aligned}\]Notice that the output is simply a long composition:
\[ Y = f(X) \quad\text{where}\quad f \equiv \sigma_z \circ h_\beta\circ \sigma_x \circ h_\alpha \]
each function is either known or linear
compute parameters by minimizing a loss function
minimization by gradient descent
Denoting the parameter vector by \(\theta = \left(\alpha^T \; \beta^T\right)\), initialize \(\theta^{(0)}\) and repeat:
\[ \theta^{(r + 1)} \longleftarrow \theta^{(r)} + c_r \nabla L^{(r)} \]
\(L^{(r)}\) is a loss function evaluated at the \(r\)th iteration
\(c_r\) is the ‘learning rate’; can be fixed or chosen adaptively
each cycle through all the parameters is one ‘epoch’
Individual parameter updates at the \(r\)th iteration are given by:
\[ \beta_{m}^{(r + 1)} \longleftarrow \beta_{m}^{(r)} + c_r \underbrace{\frac{1}{n}\sum_{i = 1}^n \frac{\partial L_i}{\partial \beta_{m}}\Big\rvert_{\beta_{m} = \beta_{m}^{(r)}}}_{\text{gradient at current iteration}} \\ \alpha_{mp}^{(r + 1)} \longleftarrow \alpha_{mp}^{(r)} + c_r \underbrace{\frac{1}{n}\sum_{i = 1}^n \frac{\partial L_i}{\partial \alpha_{mp}}\Big\rvert_{\alpha_{mp} = \alpha_{mp}^{(r)}}}_{\text{gradient at current iteration}} \]
The gradient is easy to compute. Denoting \(t_{i} = z_{i}^T\beta\):
\[ \begin{aligned} \frac{\partial L_i}{\partial \alpha_{mp}} &= \underbrace{\frac{\partial L_i}{\partial f} \frac{\partial f}{\partial t_i}}_{\delta_i} \underbrace{\frac{\partial t_i}{\partial z_{im}} \frac{\partial z_{im}}{\partial \alpha_{mp}}}_{s_{im}x_{ip}} \\ \frac{\partial L_i}{\partial \beta_{m}} &= \underbrace{\frac{\partial L_i}{\partial f} \frac{\partial f}{\partial t_i}}_{\delta_i} \underbrace{\frac{\partial t_i}{\partial \beta_{m}}}_{z_{im}} \end{aligned} \]
Explicitly computing gradients for each update gives the backpropagation algorithm of Rumelhart, Hinton, and Williams (1986) .
Initialize parameters and repeat:
Forward pass: compute \(f(X), Z\)
Backward pass: compute \(\delta_i, s_{mi}\) by ‘back-propagating’ current estimates
Update the weights
\[
\hat{\beta}_{km} \longleftarrow
\hat{\beta}_{km} + c_r \frac{1}{n}\sum_i \delta_{ki}z_{mi} \\
\hat{\alpha}_{mp} \longleftarrow
\hat{\alpha}_{mp} + c_r \frac{1}{n}\sum_i s_{mi}x_{ip}
\]
Explicitly computing the gradient sums over all observations \(i = 1, \dots, n\):
\[ g = \nabla \frac{1}{n} \sum_i L_i \]
It’s much faster to estimate the gradient based on a “batch” of \(m\) observations (subsample) \(J \subset \{1, \dots, n\}\):
\[ \hat{g} = \nabla \frac{1}{m}\sum_{i \in J} L_i \]
Modern methods for training neural networks update parameters using gradient estimates and adaptive learning rates.1
One more hidden unit.
One more hidden layer.
Networks of arbitrary width and depth in which the connectivity is uni-directional are known as “sequential” or “feedforward” networks/models.
\[ \begin{aligned} \mathbb{E}Y &= \sigma_1(Z_1\beta_1) &\text{output layer}\\ Z_k &= \sigma_k(Z_{k - 1} \beta_k) &\text{hidden layers } k = 2, \dots, D - 1\\ Z_D &\equiv X &\text{input layer} \end{aligned} \]
Suppose:
\(\mathbb{E}Y = f(X)\) gives the ‘true’ relationship
\(\tilde{f}(X)\) represents the output layer of a feedforward neural network with one hidden layer of width \(w\)
Hornik, Stinchcombe, and White (1989) showed that, under some regularity conditions, for any \(\epsilon > 0\) there exists a width \(w\) and parameters such that:
\[ \sup_x \|f(x) - \tilde{f}(x)\| < \epsilon \]
Similar results exist for deep networks with bounded width1.
These results do tell us that in most problems there exist both deep and shallow networks that approximate the true input-output relationship arbitrarily well
They don’t tell us how to find them.
Several factors can affect actual performance in practice:
Activation functions \(\sigma(\cdot)\) determine whether a given unit ‘fires’.
For example:
if \(Z_{k - 1}\beta_{kj} = -28.2\) and \(\sigma_k(x) = \frac{1}{1 + e^{-x}}\),
then \(z_{kj} = \sigma_k(Z_{k - 1}\beta_j) \approx 0\).
The most common activation functions are:
(identity) \(\sigma(x) = x\)
(sigmoid) \(\sigma(x) = \frac{1}{1 + \exp\{-x\}}\)
(hyperbolic tangent) \(\sigma(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
(rectified linear unit) \(\sigma(x) = \max (0, x)\)
The most common loss function for classification is
\[ L(Y, f(X)) = -\frac{1}{n}\sum_i \left[y_i\log p_i + (1 - y_i)\log(1 - p_i)\right] \qquad\text{(cross-entropy)} \]
The most common loss function for regression is:
\[ L(Y, f(X)) = \frac{1}{n}\sum_i (y_i - f(x_i))^2 \qquad\text{(mean squared error)} \]