PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz

UCSB

Pooled data from all sites with at least a year of continuous observation between 2017 and 2020 at 0.2m depth

Modeled seasonal trend based on elevation and day of the year using curve fitting techniques

From last time: using 4 Fourier bases and elevation.

Adding elevation x seasonality interaction terms.

Seasonal forecasts ignore recent data.

This leads to greater error.

Idea: the present temperature at time \(t\) contains useful information about the expected temperature at time \(t + 1\).

Our model for site \(i\) at time \(t\) is:

\[ \underbrace{Y_{i, t}}_{\text{temperature}} = \underbrace{\mu(i, t)}_{\text{seasonal mean}} + \underbrace{\epsilon_{i, t}}_{\text{random deviation}} \]

The ** conditional mean forecast** \(\mathbb{E}(Y_{i, t + 1}| Y_{i, t})\) should be better than the seasonal forecast \(\mathbb{E}Y_{i, t + 1} = \mu(i, t + 1)\)…

… because it incorporates information about the recent past.

The conditional mean of the next temp given the present is:

\[ \begin{aligned} \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = y_{i, t}) &= \mathbb{E}[\underbrace{\mu(i, t + 1)}_{\text{nonrandom}}|Y_{i, t} = y_{i, t}] + \mathbb{E}(\epsilon_{i, t + 1}|Y_{i, t} = y_{i, t}) \\\\ &= \mu(i, t + 1) + \mathbb{E}(\epsilon_{i, t + 1}|\epsilon_{i, t} = \underbrace{y_{i, t} - \mu(i, t)}_{\text{residual}}) \\\\ &= \underbrace{\mu(i, t + 1)}_{\text{seasonal forecast}} + \underbrace{\mathbb{E}(\epsilon_{i, t + 1}|\epsilon_{i, t} = e_{i, t})}_{\text{forecasted deviation}} \end{aligned} \]

To forecast the deviation from the seasonal mean, we should model the residuals.

\[ e_{i, t} = Y_{i, t} - \hat{\mu}(i, t) \qquad\text{(residual)} \]

Residuals are strongly correlated with their immediately previous value.

This is called ** autocorrelation** (think: self-correlation).

(1) Lag the residuals: \(\texttt{resid} = e_{i, t}\) and \(\texttt{resid.lag} = e_{i, t - 1}\)

```
# A tibble: 6 × 5
# Groups: site [1]
site date temp resid resid.lag
<chr> <date> <dbl> <dbl> <dbl>
1 B21K-1 2017-08-15 3.72 -6.18 -6.14
2 B21K-1 2017-08-16 3.31 -6.52 -6.18
3 B21K-1 2017-08-17 4.95 -4.81 -6.52
4 B21K-1 2017-08-18 5.46 -4.24 -4.81
5 B21K-1 2017-08-19 5.80 -3.82 -4.24
6 B21K-1 2017-08-20 5.68 -3.87 -3.82
```

\(\texttt{pred.mean} = \hat{\mu}(i, t)\)

```
# A tibble: 6 × 4
date elev temp pred.mean
<date> <dbl> <dbl> <dbl>
1 2019-03-12 396 -2.85 -6.83
2 2019-03-13 396 -2.86 -6.76
3 2019-03-14 396 -2.92 -6.69
4 2019-03-15 396 -3.16 -6.62
5 2019-03-16 396 -3.34 -6.55
6 2019-03-17 396 -3.28 -6.47
```

\(\texttt{pred.resid} = \hat{e}_{i, t} = \mathbb{E}(e_{i, t}|e_{i, t - 1})\)

```
# A tibble: 6 × 6
date elev temp pred.mean resid pred.resid
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-12 396 -2.85 -6.83 3.98 NA
2 2019-03-13 396 -2.86 -6.76 3.90 3.90
3 2019-03-14 396 -2.92 -6.69 3.77 3.83
4 2019-03-15 396 -3.16 -6.62 3.46 3.69
5 2019-03-16 396 -3.34 -6.55 3.20 3.39
6 2019-03-17 396 -3.28 -6.47 3.18 3.14
```

\(\texttt{pred} = \texttt{pred.mean} + \texttt{pred.resid} = \hat{\mu}(i, t) + \hat{e}_{i, t}\)

```
# A tibble: 6 × 7
date elev temp pred.mean resid pred.resid pred
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-12 396 -2.85 -6.83 3.98 NA NA
2 2019-03-13 396 -2.86 -6.76 3.90 3.90 -2.86
3 2019-03-14 396 -2.92 -6.69 3.77 3.83 -2.86
4 2019-03-15 396 -3.16 -6.62 3.46 3.69 -2.93
5 2019-03-16 396 -3.34 -6.55 3.20 3.39 -3.15
6 2019-03-17 396 -3.28 -6.47 3.18 3.14 -3.33
```

One the site we’ve been examining, the one-step forecasting error is:

.metric | .estimator | .estimate |
---|---|---|

ccc | standard | 0.9981049 |

msd | standard | 0.0536997 |

rmse | standard | 0.3494716 |

`ccc`

is the*concordance correlation coefficient*`msd`

is the*mean signed deviation*`rmse`

is the*root mean squared error*

Means and standard deviations across all 29 sites of error metrics for one-step forecasts computed across entire observation window:

.metric | average | sd | min | max | n |
---|---|---|---|---|---|

ccc | 0.9949698 | 0.0032392 | 0.9869467 | 0.9993684 | 29 |

msd | -0.0062058 | 0.0510766 | 0.0005584 | 0.1186212 | 29 |

rmse | 0.6460647 | 0.3593531 | 0.1801040 | 1.4875005 | 29 |

The one-step forecasts are the predicted conditional means at the next time step given the present :

\[ \hat{Y}_{i, t} = \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = \color{blue}{y_{i, t}}) \]

Conditional expectation gives optimal prediction under squared error loss (assuming the model is correct).

According to our model:

\[ \hat{Y}_{i, t + 1} = \hat{\mu}(i, t + 1) + \hat{\alpha}_0 + \hat{\alpha}_1 \left(\color{blue}{y_{i, t}} - \hat{\mu}(i, t)\right) \]

Multistep forecasts must be computed recursively:

\[ \begin{aligned} \color{maroon}{\hat{Y}_{i, t + 1}} &= \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = y_{i, t}) \\ \color{teal}{\hat{Y}_{i, t + 2}} &= \mathbb{E}(Y_{i, t + 2}|Y_{i, t + 1} = \color{maroon}{\hat{Y}_{i, t + 1}}) \\ \hat{Y}_{i, t + 3} &= \mathbb{E}(Y_{i, t + 3}|Y_{i, t + 2} = \color{teal}{\hat{Y}_{i, t + 2}}) \\ &\vdots \end{aligned} \]

What do you think will happen the farther out we forecast??

An alternative to pooling data together to estimate the seasonal trend and residual autocorrelation is to do so *individually* for every site.

upshot: more flexibility on approaches; can use time series techniques

downside: many models \(\longrightarrow\) more total uncertainty

For now we’ll leave the seasonality alone and revise the residual autocorrelation approach.

An *autoregressive* model of order \(D\) is

\[ X_t = \nu + \alpha_1 X_{t - 1} + \cdots + \alpha_D X_{t - D} + \epsilon_t \]

‘innovations’ \(\epsilon_t\) are \(iid\) with mean zero

process mean linear in \(D\) lags

\(\mathbb{E}X_t\) and \(\text{var}X_t\) are constant in time (‘weak stationarity’)

About the AR parameters:

constraints on \(\alpha_j\)’s needed for a well-defined process in infinite time

estimates \(\hat{\alpha}_j\) found by:

(moment estimator) solving a recursive system of equations (known as the

*Yule-Walker*equations)(mle) maximum likelihood assuming \(\epsilon_{t} \sim N(0, \sigma^2)\)

So let’s revise:

\[ \begin{aligned} Y_{i, t} &= f_i (t) + \epsilon_{i, t} \quad\text{(nonlinear regression)} \\ \epsilon_{i, t} &= \sum_{d = 1}^D \alpha_{i,d}\epsilon_{i, t - d} + \xi_{i, t} \quad\text{(AR(D) errors)} \end{aligned} \]

*Note*:

seasonal mean \(f_i(t)\) is site-dependent (hence subscript)

AR process is site-dependent (hence \(\alpha_{i, d}\) subscript)

no elevation included, since it is constant for each site

```
fit_ar2 <- arima(y_train,
order = c(2, 0, 0),
xreg = x_train,
include.mean = T,
method = 'ML')
tidy(fit_ar2) %>% knitr::kable()
```

term | estimate | std.error |
---|---|---|

ar1 | 1.5389488 | 0.0310331 |

ar2 | -0.5872016 | 0.0310424 |

intercept | 1.7120661 | 0.2661139 |

sin1 | -57.4478541 | 4.9296869 |

cos1 | -97.6093126 | 5.1741646 |

sin2 | 9.5264551 | 4.9135875 |

cos2 | 15.0945042 | 5.0136827 |

term | estimate.ar | estimate.pooled |
---|---|---|

ar1 | 1.5389488 | NA |

ar2 | -0.5872016 | NA |

intercept | 1.7120661 | 1.7428057 |

sin1 | -57.4478541 | -60.4797059 |

cos1 | -97.6093126 | -84.9711125 |

sin2 | 9.5264551 | 7.0360375 |

cos2 | 15.0945042 | 15.0271089 |

elev | NA | -0.0038422 |

elevsin1 | NA | 0.0094721 |

elevsin2 | NA | 0.0063869 |

elevcos1 | NA | -0.0627032 |

elevcos2 | NA | -0.0032467 |

residlag | NA | 0.9804500 |

Spatial prediction…

based on observations

based on forecasts

## Comments

This approach pooled data across sites to estimate model quantities.

seasonal mean (using Fourier basis approximation)

residual autocorrelation at one lag (using SLR)

Works pretty well for one-step forecasts; not very well for longer-term forecasts.