PSTAT197A/CMPSC190DD Fall 2022
UCSB
Pooled data from all sites with at least a year of continuous observation between 2017 and 2020 at 0.2m depth
Modeled seasonal trend based on elevation and day of the year using curve fitting techniques
From last time: using 4 Fourier bases and elevation.
Adding elevation x seasonality interaction terms.
Seasonal forecasts ignore recent data.
This leads to greater error.
Idea: the present temperature at time \(t\) contains useful information about the expected temperature at time \(t + 1\).
Our model for site \(i\) at time \(t\) is:
\[ \underbrace{Y_{i, t}}_{\text{temperature}} = \underbrace{\mu(i, t)}_{\text{seasonal mean}} + \underbrace{\epsilon_{i, t}}_{\text{random deviation}} \]
The conditional mean forecast \(\mathbb{E}(Y_{i, t + 1}| Y_{i, t})\) should be better than the seasonal forecast \(\mathbb{E}Y_{i, t + 1} = \mu(i, t + 1)\)…
… because it incorporates information about the recent past.
The conditional mean of the next temp given the present is:
\[ \begin{aligned} \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = y_{i, t}) &= \mathbb{E}[\underbrace{\mu(i, t + 1)}_{\text{nonrandom}}|Y_{i, t} = y_{i, t}] + \mathbb{E}(\epsilon_{i, t + 1}|Y_{i, t} = y_{i, t}) \\\\ &= \mu(i, t + 1) + \mathbb{E}(\epsilon_{i, t + 1}|\epsilon_{i, t} = \underbrace{y_{i, t} - \mu(i, t)}_{\text{residual}}) \\\\ &= \underbrace{\mu(i, t + 1)}_{\text{seasonal forecast}} + \underbrace{\mathbb{E}(\epsilon_{i, t + 1}|\epsilon_{i, t} = e_{i, t})}_{\text{forecasted deviation}} \end{aligned} \]
To forecast the deviation from the seasonal mean, we should model the residuals.
\[ e_{i, t} = Y_{i, t} - \hat{\mu}(i, t) \qquad\text{(residual)} \]
Residuals are strongly correlated with their immediately previous value.
This is called autocorrelation (think: self-correlation).
(1) Lag the residuals: \(\texttt{resid} = e_{i, t}\) and \(\texttt{resid.lag} = e_{i, t - 1}\)
# A tibble: 6 × 5
# Groups: site [1]
site date temp resid resid.lag
<chr> <date> <dbl> <dbl> <dbl>
1 B21K-1 2017-08-15 3.72 -6.18 -6.14
2 B21K-1 2017-08-16 3.31 -6.52 -6.18
3 B21K-1 2017-08-17 4.95 -4.81 -6.52
4 B21K-1 2017-08-18 5.46 -4.24 -4.81
5 B21K-1 2017-08-19 5.80 -3.82 -4.24
6 B21K-1 2017-08-20 5.68 -3.87 -3.82
\(\texttt{pred.mean} = \hat{\mu}(i, t)\)
# A tibble: 6 × 4
date elev temp pred.mean
<date> <dbl> <dbl> <dbl>
1 2019-03-12 396 -2.85 -6.83
2 2019-03-13 396 -2.86 -6.76
3 2019-03-14 396 -2.92 -6.69
4 2019-03-15 396 -3.16 -6.62
5 2019-03-16 396 -3.34 -6.55
6 2019-03-17 396 -3.28 -6.47
\(\texttt{pred.resid} = \hat{e}_{i, t} = \mathbb{E}(e_{i, t}|e_{i, t - 1})\)
# A tibble: 6 × 6
date elev temp pred.mean resid pred.resid
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-12 396 -2.85 -6.83 3.98 NA
2 2019-03-13 396 -2.86 -6.76 3.90 3.90
3 2019-03-14 396 -2.92 -6.69 3.77 3.83
4 2019-03-15 396 -3.16 -6.62 3.46 3.69
5 2019-03-16 396 -3.34 -6.55 3.20 3.39
6 2019-03-17 396 -3.28 -6.47 3.18 3.14
\(\texttt{pred} = \texttt{pred.mean} + \texttt{pred.resid} = \hat{\mu}(i, t) + \hat{e}_{i, t}\)
# A tibble: 6 × 7
date elev temp pred.mean resid pred.resid pred
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-12 396 -2.85 -6.83 3.98 NA NA
2 2019-03-13 396 -2.86 -6.76 3.90 3.90 -2.86
3 2019-03-14 396 -2.92 -6.69 3.77 3.83 -2.86
4 2019-03-15 396 -3.16 -6.62 3.46 3.69 -2.93
5 2019-03-16 396 -3.34 -6.55 3.20 3.39 -3.15
6 2019-03-17 396 -3.28 -6.47 3.18 3.14 -3.33
One the site we’ve been examining, the one-step forecasting error is:
.metric | .estimator | .estimate |
---|---|---|
ccc | standard | 0.9981049 |
msd | standard | 0.0536997 |
rmse | standard | 0.3494716 |
ccc
is the concordance correlation coefficient
msd
is the mean signed deviation
rmse
is the root mean squared error
Means and standard deviations across all 29 sites of error metrics for one-step forecasts computed across entire observation window:
.metric | average | sd | min | max | n |
---|---|---|---|---|---|
ccc | 0.9949698 | 0.0032392 | 0.9869467 | 0.9993684 | 29 |
msd | -0.0062058 | 0.0510766 | 0.0005584 | 0.1186212 | 29 |
rmse | 0.6460647 | 0.3593531 | 0.1801040 | 1.4875005 | 29 |
The one-step forecasts are the predicted conditional means at the next time step given the present :
\[ \hat{Y}_{i, t} = \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = \color{blue}{y_{i, t}}) \]
Conditional expectation gives optimal prediction under squared error loss (assuming the model is correct).
According to our model:
\[ \hat{Y}_{i, t + 1} = \hat{\mu}(i, t + 1) + \hat{\alpha}_0 + \hat{\alpha}_1 \left(\color{blue}{y_{i, t}} - \hat{\mu}(i, t)\right) \]
Multistep forecasts must be computed recursively:
\[ \begin{aligned} \color{maroon}{\hat{Y}_{i, t + 1}} &= \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = y_{i, t}) \\ \color{teal}{\hat{Y}_{i, t + 2}} &= \mathbb{E}(Y_{i, t + 2}|Y_{i, t + 1} = \color{maroon}{\hat{Y}_{i, t + 1}}) \\ \hat{Y}_{i, t + 3} &= \mathbb{E}(Y_{i, t + 3}|Y_{i, t + 2} = \color{teal}{\hat{Y}_{i, t + 2}}) \\ &\vdots \end{aligned} \]
What do you think will happen the farther out we forecast??
An alternative to pooling data together to estimate the seasonal trend and residual autocorrelation is to do so individually for every site.
upshot: more flexibility on approaches; can use time series techniques
downside: many models \(\longrightarrow\) more total uncertainty
For now we’ll leave the seasonality alone and revise the residual autocorrelation approach.
An autoregressive model of order \(D\) is
\[ X_t = \nu + \alpha_1 X_{t - 1} + \cdots + \alpha_D X_{t - D} + \epsilon_t \]
‘innovations’ \(\epsilon_t\) are \(iid\) with mean zero
process mean linear in \(D\) lags
\(\mathbb{E}X_t\) and \(\text{var}X_t\) are constant in time (‘weak stationarity’)
About the AR parameters:
constraints on \(\alpha_j\)’s needed for a well-defined process in infinite time
estimates \(\hat{\alpha}_j\) found by:
(moment estimator) solving a recursive system of equations (known as the Yule-Walker equations)
(mle) maximum likelihood assuming \(\epsilon_{t} \sim N(0, \sigma^2)\)
So let’s revise:
\[ \begin{aligned} Y_{i, t} &= f_i (t) + \epsilon_{i, t} \quad\text{(nonlinear regression)} \\ \epsilon_{i, t} &= \sum_{d = 1}^D \alpha_{i,d}\epsilon_{i, t - d} + \xi_{i, t} \quad\text{(AR(D) errors)} \end{aligned} \]
Note:
seasonal mean \(f_i(t)\) is site-dependent (hence subscript)
AR process is site-dependent (hence \(\alpha_{i, d}\) subscript)
no elevation included, since it is constant for each site
fit_ar2 <- arima(y_train,
order = c(2, 0, 0),
xreg = x_train,
include.mean = T,
method = 'ML')
tidy(fit_ar2) %>% knitr::kable()
term | estimate | std.error |
---|---|---|
ar1 | 1.5389488 | 0.0310331 |
ar2 | -0.5872016 | 0.0310424 |
intercept | 1.7120661 | 0.2661139 |
sin1 | -57.4478541 | 4.9296869 |
cos1 | -97.6093126 | 5.1741646 |
sin2 | 9.5264551 | 4.9135875 |
cos2 | 15.0945042 | 5.0136827 |
term | estimate.ar | estimate.pooled |
---|---|---|
ar1 | 1.5389488 | NA |
ar2 | -0.5872016 | NA |
intercept | 1.7120661 | 1.7428057 |
sin1 | -57.4478541 | -60.4797059 |
cos1 | -97.6093126 | -84.9711125 |
sin2 | 9.5264551 | 7.0360375 |
cos2 | 15.0945042 | 15.0271089 |
elev | NA | -0.0038422 |
elevsin1 | NA | 0.0094721 |
elevsin2 | NA | 0.0063869 |
elevcos1 | NA | -0.0627032 |
elevcos2 | NA | -0.0032467 |
residlag | NA | 0.9804500 |
Spatial prediction…
based on observations
based on forecasts
Comments
This approach pooled data across sites to estimate model quantities.
seasonal mean (using Fourier basis approximation)
residual autocorrelation at one lag (using SLR)
Works pretty well for one-step forecasts; not very well for longer-term forecasts.