Building a forecasting model

PSTAT197A/CMPSC190DD Fall 2022

Trevor Ruiz

UCSB

From last time

Pooled data from all sites with at least a year of continuous observation between 2017 and 2020 at 0.2m depth
Modeled seasonal trend based on elevation and day of the year using curve fitting techniques

Seasonal trend model

From last time: using 4 Fourier bases and elevation.

\(Y_{i, t} = \beta_0 + \beta_1 \text{elev}_i + \sum_{j = 1}^4 \gamma_j \phi_j(t) + \epsilon_{i, t}\)

A small adjustment

Adding elevation x seasonality interaction terms.

\(Y_{i, t} = \beta_0 + \beta_1 \text{elev}_i + \sum_{j = 1}^4 \left(\gamma_j \phi_j(t) + \color{maroon}{\delta_j \text{elev}\times\phi_j(t)}\right) + \epsilon_{i, t}\)

Seasonal forecast

Seasonal forecasts ignore recent data.

10-day forecast based on seasonal mean only.

Forecasting error

This leads to greater error.

Seasonal forecast vs. observed temperature.

A better forecast

Idea: the present temperature at time \(t\) contains useful information about the expected temperature at time \(t + 1\).

Our model for site \(i\) at time \(t\) is:

\[ \underbrace{Y_{i, t}}_{\text{temperature}} = \underbrace{\mu(i, t)}_{\text{seasonal mean}} + \underbrace{\epsilon_{i, t}}_{\text{random deviation}} \]

The conditional mean forecast \(\mathbb{E}(Y_{i, t + 1}| Y_{i, t})\) should be better than the seasonal forecast \(\mathbb{E}Y_{i, t + 1} = \mu(i, t + 1)\)…

… because it incorporates information about the recent past.

Conditional mean forecast

The conditional mean of the next temp given the present is:

\[ \begin{aligned} \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = y_{i, t}) &= \mathbb{E}[\underbrace{\mu(i, t + 1)}_{\text{nonrandom}}|Y_{i, t} = y_{i, t}] + \mathbb{E}(\epsilon_{i, t + 1}|Y_{i, t} = y_{i, t}) \\\\ &= \mu(i, t + 1) + \mathbb{E}(\epsilon_{i, t + 1}|\epsilon_{i, t} = \underbrace{y_{i, t} - \mu(i, t)}_{\text{residual}}) \\\\ &= \underbrace{\mu(i, t + 1)}_{\text{seasonal forecast}} + \underbrace{\mathbb{E}(\epsilon_{i, t + 1}|\epsilon_{i, t} = e_{i, t})}_{\text{forecasted deviation}} \end{aligned} \]

Modeling the residuals

To forecast the deviation from the seasonal mean, we should model the residuals.

\[ e_{i, t} = Y_{i, t} - \hat{\mu}(i, t) \qquad\text{(residual)} \]

Residual autocorrelation

Residuals are strongly correlated with their immediately previous value.

This is called autocorrelation (think: self-correlation).

An intuitive approach: SLR

(1) Lag the residuals: \(\texttt{resid} = e_{i, t}\) and \(\texttt{resid.lag} = e_{i, t - 1}\)

# A tibble: 6 × 5
# Groups:   site [1]
  site   date        temp resid resid.lag
  <chr>  <date>     <dbl> <dbl>     <dbl>
1 B21K-1 2017-08-15  3.72 -6.18     -6.14
2 B21K-1 2017-08-16  3.31 -6.52     -6.18
3 B21K-1 2017-08-17  4.95 -4.81     -6.52
4 B21K-1 2017-08-18  5.46 -4.24     -4.81
5 B21K-1 2017-08-19  5.80 -3.82     -4.24
6 B21K-1 2017-08-20  5.68 -3.87     -3.82

(2) Fit SLR at one time lag: \(e_{i, t} = \beta_0 + \beta_1 e_{i, t - 1} + \xi_{i, t}\).

fit_resid <- lm(resid ~ resid.lag - 1, data = resid_df)

\(\texttt{pred.mean} = \hat{\mu}(i, t)\)

# A tibble: 6 × 4
  date        elev  temp pred.mean
  <date>     <dbl> <dbl>     <dbl>
1 2019-03-12   396 -2.85     -6.83
2 2019-03-13   396 -2.86     -6.76
3 2019-03-14   396 -2.92     -6.69
4 2019-03-15   396 -3.16     -6.62
5 2019-03-16   396 -3.34     -6.55
6 2019-03-17   396 -3.28     -6.47

\(\texttt{pred.resid} = \hat{e}_{i, t} = \mathbb{E}(e_{i, t}|e_{i, t - 1})\)

# A tibble: 6 × 6
  date        elev  temp pred.mean resid pred.resid
  <date>     <dbl> <dbl>     <dbl> <dbl>      <dbl>
1 2019-03-12   396 -2.85     -6.83  3.98      NA   
2 2019-03-13   396 -2.86     -6.76  3.90       3.90
3 2019-03-14   396 -2.92     -6.69  3.77       3.83
4 2019-03-15   396 -3.16     -6.62  3.46       3.69
5 2019-03-16   396 -3.34     -6.55  3.20       3.39
6 2019-03-17   396 -3.28     -6.47  3.18       3.14

\(\texttt{pred} = \texttt{pred.mean} + \texttt{pred.resid} = \hat{\mu}(i, t) + \hat{e}_{i, t}\)

# A tibble: 6 × 7
  date        elev  temp pred.mean resid pred.resid  pred
  <date>     <dbl> <dbl>     <dbl> <dbl>      <dbl> <dbl>
1 2019-03-12   396 -2.85     -6.83  3.98      NA    NA   
2 2019-03-13   396 -2.86     -6.76  3.90       3.90 -2.86
3 2019-03-14   396 -2.92     -6.69  3.77       3.83 -2.86
4 2019-03-15   396 -3.16     -6.62  3.46       3.69 -2.93
5 2019-03-16   396 -3.34     -6.55  3.20       3.39 -3.15
6 2019-03-17   396 -3.28     -6.47  3.18       3.14 -3.33

One-step forecasts

Seasonal forecast (blue curve), residual forecasts (red lines), one-step forecasts (dotted line), and observed temperatures (solid black line).

Forecasting error

One the site we’ve been examining, the one-step forecasting error is:

.metric	.estimator	.estimate
ccc	standard	0.9981049
msd	standard	0.0536997
rmse	standard	0.3494716

ccc is the concordance correlation coefficient
msd is the mean signed deviation
rmse is the root mean squared error

Average forecasting error across sites

Means and standard deviations across all 29 sites of error metrics for one-step forecasts computed across entire observation window:

.metric	average	sd	min	max	n
ccc	0.9949698	0.0032392	0.9869467	0.9993684	29
msd	-0.0062058	0.0510766	0.0005584	0.1186212	29
rmse	0.6460647	0.3593531	0.1801040	1.4875005	29

One-step forecasts (mathematically)

The one-step forecasts are the predicted conditional means at the next time step given the present :

\[ \hat{Y}_{i, t} = \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = \color{blue}{y_{i, t}}) \]

Conditional expectation gives optimal prediction under squared error loss (assuming the model is correct).

According to our model:

\[ \hat{Y}_{i, t + 1} = \hat{\mu}(i, t + 1) + \hat{\alpha}_0 + \hat{\alpha}_1 \left(\color{blue}{y_{i, t}} - \hat{\mu}(i, t)\right) \]

Multi-step forecasts

Multistep forecasts must be computed recursively:

\[ \begin{aligned} \color{maroon}{\hat{Y}_{i, t + 1}} &= \mathbb{E}(Y_{i, t + 1}|Y_{i, t} = y_{i, t}) \\ \color{teal}{\hat{Y}_{i, t + 2}} &= \mathbb{E}(Y_{i, t + 2}|Y_{i, t + 1} = \color{maroon}{\hat{Y}_{i, t + 1}}) \\ \hat{Y}_{i, t + 3} &= \mathbb{E}(Y_{i, t + 3}|Y_{i, t + 2} = \color{teal}{\hat{Y}_{i, t + 2}}) \\ &\vdots \end{aligned} \]

What do you think will happen the farther out we forecast??

Multistep forecasts on one site

Comments

This approach pooled data across sites to estimate model quantities.

seasonal mean (using Fourier basis approximation)
residual autocorrelation at one lag (using SLR)

Works pretty well for one-step forecasts; not very well for longer-term forecasts.

Time series models

Site-specific approach

An alternative to pooling data together to estimate the seasonal trend and residual autocorrelation is to do so individually for every site.

upshot: more flexibility on approaches; can use time series techniques
downside: many models \(\longrightarrow\) more total uncertainty

For now we’ll leave the seasonality alone and revise the residual autocorrelation approach.

Autoregression

An autoregressive model of order \(D\) is

\[ X_t = \nu + \alpha_1 X_{t - 1} + \cdots + \alpha_D X_{t - D} + \epsilon_t \]

‘innovations’ \(\epsilon_t\) are \(iid\) with mean zero
process mean linear in \(D\) lags
\(\mathbb{E}X_t\) and \(\text{var}X_t\) are constant in time (‘weak stationarity’)

Technical asides

About the AR parameters:

constraints on \(\alpha_j\)’s needed for a well-defined process in infinite time
estimates \(\hat{\alpha}_j\) found by:
- (moment estimator) solving a recursive system of equations (known as the Yule-Walker equations)
- (mle) maximum likelihood assuming \(\epsilon_{t} \sim N(0, \sigma^2)\)

Site-specific model

So let’s revise:

\[ \begin{aligned} Y_{i, t} &= f_i (t) + \epsilon_{i, t} \quad\text{(nonlinear regression)} \\ \epsilon_{i, t} &= \sum_{d = 1}^D \alpha_{i,d}\epsilon_{i, t - d} + \xi_{i, t} \quad\text{(AR(D) errors)} \end{aligned} \]

Note:

seasonal mean \(f_i(t)\) is site-dependent (hence subscript)
AR process is site-dependent (hence \(\alpha_{i, d}\) subscript)
no elevation included, since it is constant for each site

Fit comparison

AR fit
Comparision of estimates

fit_ar2 <- arima(y_train, 
      order = c(2, 0, 0), 
      xreg = x_train, 
      include.mean = T, 
      method = 'ML')

tidy(fit_ar2) %>% knitr::kable()

term	estimate	std.error
ar1	1.5389488	0.0310331
ar2	-0.5872016	0.0310424
intercept	1.7120661	0.2661139
sin1	-57.4478541	4.9296869
cos1	-97.6093126	5.1741646
sin2	9.5264551	4.9135875
cos2	15.0945042	5.0136827

term	estimate.ar	estimate.pooled
ar1	1.5389488	NA
ar2	-0.5872016	NA
intercept	1.7120661	1.7428057
sin1	-57.4478541	-60.4797059
cos1	-97.6093126	-84.9711125
sin2	9.5264551	7.0360375
cos2	15.0945042	15.0271089
elev	NA	-0.0038422
elevsin1	NA	0.0094721
elevsin2	NA	0.0063869
elevcos1	NA	-0.0627032
elevcos2	NA	-0.0032467
residlag	NA	0.9804500

Forecast comparison

Next time

Spatial prediction…

based on observations
based on forecasts