【课程】Modelling and Forecasting Financial Markets

基于格拉斯哥ECON5022的笔记;

General information

The course offers an introduction to modelling and forecasting financial time series. The first part of the course will be mainly devoted to analysing univariate models for the conditional mean and the conditional variance (ARMA and GARCH models). These models will be used to produce forecasts. Additional topics, e.g. multiple time series analysis and nonlinear models may be discussed, if time allows. In the second part of the course will discuss forecasts evaluation, aimed to monitor and improve forecast performances. The course will be complemented by practical session using statistical or econometric software.

Aims

The main aims of the course are to introduce the basic models widely used to analyse and forecast financial time series, and to evaluate the forecast produced using these models.

ILOs

By the end of this course, students will be able to:

  1. Select and fit the appropriate model to analyse financial time series.
  2. Derive the main properties of the models used to analyse and forecast financial time series.
  3. Produce optimal forecasts for a given information set and forecast horizon.
  4. Evaluate critically the forecasts.
  5. Model and predict financial time series using statistical/econometric software.
  6. Work collaboratively in a group to produce a combined output, by liaising with other class members, allocating tasks and co-ordinating.

Unit 1:Time Series and Their Features

Reading

  • Brooks, Chapter 2, Section 2.7 (Essential)
  • Brooks, Chapter 6, Sections 6.1-6.2 (Essential)

Students are supposed to be familiar with the topics covered in the sections below:

  • Brooks, Chapter 1, Sections 1.5-1.6.
  • Brooks, Chapter 2, Section 2.1-2.5.

Activities

  • Brooks, Chapter 1, Self-Study Questions 6, 10, 23.
  • Brooks, Chapter 2, Self-Study Questions 1-6, 9.

Definition:

Most financial studies involves returns, instead of prices, of assets

  • Return of an asset is a scale-free summary of an investment opportunity.
  • Return series are easier to handle than prices because of the more attractive statistical properties.

Time Series Data

  1. The term time series is used to mean both the data $\{x_t\}$ and the process $\{X_t\}$ of which it is a realization.
  2. $\{x_t\}$ and $\{X_t\}$ are the shorthand notations for $\{x_t , t \in T_0\}$ and $\{X_t , t \in T_0\}$ when it is not necessary to specify $\mathbb{T}_0$ where $\mathbb{T}_0$ is a discrete set..

Mean function

covariance function

Strict Stationarity

The process is said to be strictly stationary if thejoint distribution of $(X_{t_1}, X_{t_2}, . . .X_{t_n})’$ and $(X_{t_1+h}, X_{t_2+h}, . . .X_{t_n+h})’$ are the same for all the positive integers $n$ and for all $t_1, t_2, . . . , t_n$

Weak Stationarity

  1. $\mathbb{E}|X_t|^2<\infty$,
  2. $\mu(t)=\mu \forall t \in \mathbb{Z}$
  3. $\gamma(t+h,t)=\gamma(h) \forall t\in\mathbb{Z},h\in\mathbb{N}$

Autocovariance function (ACVF)

Autocorrelation function (ACF)

A weakly stationary time series $\{X_t\}$ is not serially autocorrelated if $\rho(h) = 0$ for all $h > 0$.

For a given sample $\{x_t\}$

If $\{X_t\}$ is a sequence of independent and identically distributed (iid) random variables satisfying $\mathbb{E}|X_t|^2<\infty$

The interval $[-\frac{1.96}{\sqrt{T}},\frac{1.96}{\sqrt{T}}]$ is the 95% non-rejection region for the null $\rho(h) = 0$. We can use the confidence interval to test individual ACFs.

Box and Pierce (1970) proposed the statistic

as a test static for $H_0 : \rho(1) = · · · = \rho(m) = 0$ versus $H_1 : \rho(h) \not= 0$ for some $k \in \{1, . . . , m\}$. The null is rejected if $Q^∗ (m)$ lies in the upper tail of $\chi^2 (m)$.

Ljung and Box (1978) modified the statistic to improve the small sample properties of the test

Self-Study Questions:

CHP 2:Q5Which is a more useful measure of central tendency for stock returns − the arithmetic mean or the geometric mean? Explain your answer.

The geometric return is always less than or equal to the arithmetic return, and so the geometric return is a downward-biased predictor of future performance.

If the objective is to summarise historical performance, the geometric mean is more appropriate, but if we want to forecast future returns, the arithmetic mean is the one to use.

CHP 2:Q10Real Return (Simple Return -Inflation)

Unit 2:Linear Models for Stationary Time Series

Reading

  • Brooks, Chapter 6, Sections 6.1-6.8 (Essential)
  • Diebold, Chapter 6, Sections 6.1-6.6 (Recommended)
  • Diebold, Chapter 7, Sections 7.1-7.2 (Recommended)

Activities

  • Brooks, Chapter 6, Self-Study Questions 1-9

Definition:

White noise

  • mean 0 and variance $\sigma^2$, $\epsilon_t\sim WN(0,\sigma^2)$
  • $\gamma(h)=\begin{cases}\sigma^2&if h= 0\ 0&if h\ne 0 \end{cases}$
  • If iid $\to$ a strong white noise

Moving Average Models (MA)

  • $X_t = \epsilon_t+\theta_1\epsilon_{t-1}+\theta_2\epsilon_{t-2}…, \epsilon_t\sim WN(0,\sigma^2)$

  • $\gamma(h)=\begin{cases}\sigma^2(\theta_h+\sum_{j=1}^{q-h}\theta_{h+j}\theta_j)&0\leq h\leq q\\0&h>q\end{cases}$

  • MA(q) Test by ACF

MA(1)

  • $X_t=\mu+\epsilon_t+\theta\epsilon_{t-1}$
  • For $Y_{t-k}=X_{t-k}-\mu=\epsilon_{t-k}+\theta\epsilon_{t-k-1}$

By backward substitution

  • If $k\to\infty and |\theta_1|<1$, $Y_t=\sum_{j=1}^\infty\theta^jY_{t-j}+\epsilon_t$

Autoregressive Models (AR)

  • $X_t = \phi_0+\phi_1X_{t-1}+\phi_2X_{t-2}…+\epsilon_t, \epsilon_t\sim WN(0,\sigma^2)$
  • AR(p) Test by PACF

AR(1)

Summation for geometric sequence:$S_n=a_1+…+a_n=a_1\frac{1-q^n}{1-q}$

we obtain

  • If $k\to\infty and |\phi_1|<1$, $X_t=\mu+\sum_{j=0}^\infty\phi_1^j\epsilon_{t-j}$

Yule-Walker

For AR(p) we have

Times $Y_{t-1}$ at both side and taking $\mathbb{E}$, with $\mathbb{E}(Y_t)=0$

Linear projection of $X$ onto $\{1, z_1, . . . , z_n\}$

where $a_0, . . . , a_n$ satisfy

同等于OLS中$\sum_{i=1}^n\hat{u}_ix_i=0$, 此处是$z_j$对$X$的回归

Partial Autocorrelation Function (PACF)

  • $\alpha(0)\equiv 1$
  • $\alpha(1)\equiv\rho(1)=\mathbb{Corr}(X_2,X_1)$
  • $\alpha(2)=\mathbb{Corr}(e_2,e_1)=\frac{\rho(2)-\rho^2(1)}{1-\rho^2(1)}$
  • $\alpha(h)\equiv\mathbb{Corr}(e_{h+1},e_1), h>1$
    • $e_{h+1}=X_{h+1}-\mathscr{P}[X_{h+1}|1, X_2,…, X_h]$
    • $e_{1}=X_{1}-\mathscr{P}[X_1|1, X_2,…, X_h]$
  • $\alpha(h)=\phi_{hh}$
    • $\mathscr{P}[X_{h+1}|1, X_1,…, X_h]=\phi_{h0}+\phi_{h1}X_h+…+\phi_{hh}X_1$
    • $\alpha=\Gamma_p^{-1}\gamma(p)$

If ACF and PACF are both declining geometrically $\to$ ARMA(1,1)

Box and Jenkins (1976) suggest the following approach:

  • Identification
  • Estimation
  • Diagnostic Checking

Information criteria

The residuals satisfy the equation ARMA(p,q)

Let $k=p+q+1$. The information criteria are often written as

  • AIC = $ln(\hat{\sigma}^2)+2\frac{k}{T}$
  • SBIC = $ln(\hat{\sigma}^2)+\frac{k}{T}ln(T)$​
  • HQIC = $ln(\hat{\sigma}^2)+\frac{2k}{T}ln(ln(T))$​​

where $\hat{\sigma}^2=T^{-1}\sum_{t=1}^Te_t^2$

Select p,q to minimize the information criteria.

lag operator $L$

  • $L^kX_t=X_{t-k}$

  • The AR(p) process can be rewritten as

    • $\Phi(L)X_t=\phi_0+\epsilon, \Phi(L)=1-\phi_1L-\phi_2L^2-…-\phi_pL^p$
  • The solutions of the equation $\Phi(z)=0$ are called the roots of the polynomial.

Stationarity of an AR(p) process

If $\Phi(z)\ne 0$ for $|Z|\leq 1$, then $\{X_t\}$ is stationary and causal, and the following causal representation

where

  • $\mu = \frac{\phi_0}{\Phi(1)}$
  • $\Psi(z)=1+\psi_1z+\psi_2z^2+…=\Phi^{-1}(z)=1/\Phi(z)$
  • For AR(1), $\psi_j=\phi^j$ by Taylor Expansion
  • For AR(2), $\psi_0=1,\psi_1=\phi_1,\psi_j=\phi_1\psi_{j-1}+\phi_2\psi_{j-2}$

Stationarity only need $|Z|\ne 1$, So we can find the stationary solution of AR(1) with $|\phi|>1$

Equation can be rewritten as

we get

ARMA models

Autocorrelation Function

Causal Processes

Invertible Processes

We have

where

  1. $\{Y_t\}$ is stationary and causal if and only if

    $\Phi(z)\ne 0 for all |z|\leq1$

    and the coefficients are determined by the relationship

    $\Psi(z)=\sum_{j=0}^\infty\psi_jz^j=\frac{\Theta(z)}{\Phi(z)}$

  2. $\{Y_t\}$ is invertible if and only if

    $\Theta(z)\ne 0 for all |z|\leq1$

    and the coefficients are determined by the relationship

    $\Pi(z)=\sum_{j=0}^\infty\pi_jz^j=\frac{\Phi(z)}{\Theta(z)}$

  3. The ARMA(p,q) models is often preferred for parsimony reasons:

    $\frac{\Theta(z)}{\Phi(z)}=1+\sum_{j=1}^\infty\psi_j^{(p,q)}z^j$

$z=\frac{-b\pm\sqrt{b^2-4ac}}{2a}$

On the representation of ARMA processes

We have $\Psi(z)=\frac{\Theta(z)}{\Phi(z)}$

Rewrite as $\Psi(z)\Phi(z)=\Theta(z)$ with $\phi_0=\theta_0=1$, i.e.

Evaluating at $z=0$, we find $\psi_0=1$

Taking first derivative

Evaluating at $z=0$, we find $\psi_1-\phi_1\psi_0=\theta_1$, so $\psi_1=\phi_1+\theta_1$

Taking first derivative again

Evaluating at $z=0$, we find $2\psi_2-2\phi_2-2\phi_1\psi_1=2\theta_2$, so $\psi_2=\theta_2+\phi_2+\phi_1^2+\phi_1\theta_1$

Repeating the step above, we can find

Lab:

ACF/PACF Confident interval $\pm 1.96/\sqrt{T}$

Persistent : How long the shock will die out. ( long when $\theta \to 1$)

Self-Study Questions:

CHP 6:Q8Comment on ‘Given that the objective of any econometric modelling exercise is to find the model that most closely ‘fits’ the data, then adding more lags to an ARMA model will almost invariably lead to a better fit. Therefore a large model is best because it will fit the data more closely.’

In most financial series, there is a substantial amount of ‘noise’. This can be interpreted as a number of random events that are unlikely to be repeated in any forecastable way. We want to fit a model to the data which will be able to ‘generalise’. In other words, we want a model which fits to features of the data which will be replicated in future; we do not want to fit to sample-specific noise.

This is why we need the concept of ‘parsimony’ – fitting the smallest possible model to the data. Otherwise we may get a great fit to the data in the sample, but any use of the model for forecasts could yield terrible results.

Another important point is that the larger the number of estimated parameters (i.e., the more variables we have), then the smaller will be the number of degrees of freedom, and this will imply that coefficient standard errors will be larger than they would otherwise have been. This could lead to a loss of power in hypothesis tests, and variables that would otherwise have been significant are now insignificant.

CHP 6:Q9You obtain the following sample autocorrelations and partial autocorrelations for a sample of 100 observations from actual data:

Lag 1 2 3 4 5 6
acf 0.420 0.104 0.032 −0.206 −0.138 0.042
pacf 0.632 0.381 0.268 0.199 0.205 0.101

MA(1), the significant lag 4 acf is a typical wrinkle that one might expect with real data and should probably be ignored.

Use the Ljung–Box Q* test to determine whether the first three autocorrelation coefficients taken together are jointly significantly different from zero.

Unit 3:Forecasting with ARMA models

Reading

  • Brooks, Chapter 6, Sections 6.10.1 - 6.10.8 (Essential)
  • Diebold, Chapter 6, Sections 6.7, 6.8, 7.3-7.5 (Essential)

Activities

  • Brooks, Chapter 6, Self-Study Questions 10(a)-(b); 11(a)-(c); 12(a)-(e).

Definition:

  • h : forecast horizon
  • Univariate information set : $\Omega_T \equiv\{X_t,t\leq T\}$
  • The definition of the forecast horizon implies that we are dealing with out of sample forecasts

Forecast types

  1. Point Forecast: A single number.
  2. Interval Forecast: A range of values in which we expect the realized value of the series to fall with a given probability. The length of the interval depends on the uncertainty surrounding the point forecast.
  3. Density Forecast: The conditional probability distribution of the series at a give forecast horizon.

Loss Function

  • $\{X_t\}^T_{t=1}$ : Series
  • Point forecast : $f_{T,h}=\mathbb{E}(X_{T+h}|\Omega_T)$
  • Forecast error : $e_{T,h}\equiv X_{T+h}-f_{T,h}$
  • Mean Squared Error (MSE) : $\mathbb{E}[e_{T,h}]^2$
  • $\mathbb{E}[X_{T+h}-\mathbb{E}(X_{T+h}|\Omega_T )]\leq \mathbb{E}[e_{T,h}]^2$
  • Loss function : $l(e_{T,h})$

We will consider $Y_t=X_t-\mu, \mathbb{E}(Y_t)=0$

Forecasting based on lagged $\epsilon$’s

Consider a stationary process $\{Y_t\}$ with Wold representation

with $\psi_0=1$ and $\sum_{j=0}^\infty \psi_j^2<\infty$

Recall that for $f_{T,h}=\mathscr{P}[Y_{T+h}|\epsilon_j,j\leq T]=\sum_{j=0}^\infty\beta_j\epsilon_{T-j}$

So $\psi_h=\beta_0,\psi_{h+1}=\beta_1,….$

The optimal linear forecast takes the form

The forecast error takes the form

with $\mathbb{E}(e_{T,h})=0$ and $\mathbb{Cov}(e_{T,h},\epsilon_j)=0 for j\leq T$

We also note that $e_{T,h}\sim MA(h-1)$ process, and

that is, the MSE (risk) approaches to $\mathbb{Var}(Y_t)$ when $h \to \infty$

$Y_t$ can be rewritten as ,$Y_t=\Psi(L)\epsilon_t$ with

Divide $\Psi(L)$ by $L^h$

The annihilation operator $[·]_+$ replace the negative powers of L with a zero,

Then, optimal linear forecast can be rewritten as

Forecasting based on lagged $Y_t$’s

If $Y_t$ admits the $AR(\infty)$ representation

we can obtain the forecast of $Y_{T+h}$ as a function of $\Omega_T$

known as Wiener-Kolmogorov prediction formula.

Forecasting an AR(1) process

1-step-ahead forecast :

2-step-ahead forecast:

h-step-ahead forecast:

We refer to this approach as the chain rule of forecasting

Noting that

we have

Hence, using the Wiener-Kolmogorov prediction formula

Forecasting an AR(2) process

By backward substitution we get

The Wiener-Kolmogorov prediction formula gives

The same result can be obtained using the chain rule

Forecasting an MA(1) process

Note that

Hence,

and view $\theta\epsilon_T$ as the outcome of the infifinite recursion

Interval and Density forecasts

Let $\{Y_t\}$

be a stationary and invertible Gaussian ARMA process. Then,

By construction,

The 95% h-step-ahead interval forecast is

The h-step-ahead density forecast is $N(f_{T,h},\mathbb{Var}(e_{T,h}))$

Making the forecasts operational

  • For forecasting an AR(p) process, an optimal h−step-ahead forecast based on $\Omega_T$ make uses only of the p most recent values $\{Y_T , Y_{T −1}, . . . , Y_{T −p+1}\}$.
  • For an MA or ARMA process we would need to know $\Omega_T$
    to use the Wiener-Kolmogorov prediction formula.

In MA(1) Process, if we assume $\theta$ is not observable

We replace the optimal linear forecast $f_{T,1}=\theta\epsilon_T$ with $\hat{f}_{T,1}=\hat\theta\epsilon_T$

The forecast error will take the form

$\mathbb{Var}(\hat{e}_{T,1})$ will also account for the variability of $\hat\theta$

MSE

MAE

Lab :

or

Self-Study Questions:

CHP 6:Q12.a】Macroeconomic explanatory variables 对比 financial data 的劣势是,它通常是 on a quarterly or at best monthly basis. 频率低.

CHP 6:Q12.c】如果一个模型 Pacf 联合显著不为 0,但是 Pacf 中并没有明显的截断,且 Acf 中存在明显的截断,那么可能这个现实数据并不属于 ARMA family,但我们依旧要尝试寻找最优模型.

Unit 4:ARIMA models for nonstationary time series

Reading

  • Brooks, Chapter 8, Sub-sections 8.1.1 - 8.1.4 (Essential)
  • Brooks, Chapter 8, Sub-sections 8.1.5-8.1.7, Section 8.2(Further)

Activities

  • Brooks, Chapter 8, Self-Study Questions 1-3.

Definition:

To deal with such nonstationarity we begin characterizing a time series as

  • $U_t$:a zero-mean stationary process
  • $\mu_t=\sum_{j=0}^m\beta_jt^m$:such trend is said to be deterministic.

The deterministic trend can either be estimated or be removed by transformation

If $m=1, U_t=\epsilon_t, X_t=\beta_0+\beta_1t+\epsilon_t$

The MA process $\Delta\epsilon_t$ is stationary but not invertible

Stochastic Trends

  • Increasing trends alternates to decreasing trends

  • A process which is not stationary in mean could be modelled as an ARMA where $\Phi(z)=0$ for $|z| \leq 1$.

  • These processes are characterized by stochastic trends. More later (Beveridge-Nelson Decomposition)

Random Walk

The representation

  • $X_0\in\mathbb{R}$ denotes the initial value

The “memory” of the process is non-decreasing

  • For $t$ large compared to $h$, $\mathbb{Corr}(X_t , X_t+h)\simeq 1$.

The 1-step ahead forecast of model (1) is

  • Similarly, $f_{T,2}=X_T$

The h−step ahead forecast error is

  • so that $\mathbb{Var}[e_{T,h}]=h\sigma^2$, which diverges to $\infty$ when $h\to\infty$.

Random Walk with Drift

If $X_t$ denotes the log price of a stock, the random walk hypothesis entails that

If $\Delta X_t=\mu+\epsilon_t$, then

  • $\{X_t\}$ is said to be random walk with drift

Unit root processes

The random walk can be represented as

  • Because $\Delta=(1-L)$, the root of the characteristic equation is one

Consider the more general example

The lag polynomial can be factorized as

The process $\{\Delta X_t\}$ satisfying

Integrated processes

A time series $\{X_t\}$is said to be integrated of order $d$, written $X_t\sim I(d)$, if $\Delta^dX_t$ is $I(0)$

  • $\{X_t\}\sim I(d)$ is sometimes said difference stationary

Stochastic processes integrated of order 0

A time series satisfying $X_t-\mathbb{E}(X_t)=\sum_{j=0}^\infty\psi_j\epsilon_{t-j}$ is said to be integrated of order zero, if

Let $|\theta|<1$

  • If $X_t=X_{t-1}+\epsilon_t-\theta\epsilon_{t-1}$, then $\Delta X_t=(1-\theta L)\epsilon_t$ and

Hence, $\Delta X_t$ is an $I(0)$ process

  • If $X_t=\epsilon_t-\theta\epsilon_{t-1}$, then $\Delta X_t=(1-L)(1-\theta L)\epsilon_t$ and

Hence, $\Delta X_t$ is not an $I(0)$ process

ARIMA(p,d,q) process

A time series $\{X_t\}$ is called an ARIMA process of order (p,d,q), written $\{X_t\}\sim ARIMA(b,d,q)$, if

is a stationary and invertible ARMA(p,q) process.

The Beveridge-Nelson Decomposition

Let $X_T$ be an ARIMA(p,1,q) and $\Psi(L)\equiv\Phi^{-1}(L)\Theta(L)$. Then,

where

  • $\mu t$ is the deterministic (linear) trend
  • $\Psi(1)\sum_{j=1}^t\epsilon_j$ is the stochastic trend
  • $\Psi^\star(L)=\sum_{j=0}^\infty\psi_j^\star L^j$, $\psi_j^*=-\sum_{k=j+1}^\infty\psi_k$
  • $k_0$ denotes the initial condition

Characteristics of I(0) processes

  • $\mathbb{Var}(X_t)$ is finite and does not depended on t.
  • The innovation $\epsilon_t$ has a temporary effect on $X_t$.
  • The expected length of time between crossing of $\mu$ is finite, so that $X_t$ fluctuates around its mean $\mu$.
  • The autocorrelation $\rho(h)$ decreases in magnitude for large enough $h$, so their sum is finite.

Characteristics of I(1) processes

  • $\mathbb{Var}(X_t)$ goes to infinite as $t$ goes to infinity.
  • The innovation $\epsilon_t$ has a permanent effect on $X_t$.
  • The process is not mean-reverting.
  • The autocorrelation $\rho(h)\to 1$ for all $h$ as $t\to\infty$.

Unit roots testing

Dickey Fuller Test

Main idea: test of $\phi=1$ in the regression model

against the one-sided alternative $\phi<1$.

If we rewrite the regression model as

and assume that $|\phi|\leq1$ ($|\phi|>1$ ruled out),

We can test the familiar hypothesis

The standard t-test for statistically significance based on

The Augmented Dickey Fuller (ADF) Test

The tests above are only valid if $\{\epsilon_t\}\sim WN(0,\sigma^2)$ An “overly parsimonious” specifification will result in autocorrelated errors.

If $Y_t$ satisfies $(1-\sum_{j=1}^p\phi_jL^p)X_t=\epsilon_t$, the following always holds:

where $\lambda=1-\phi_1-…-\phi_p, \phi_j^*=-\sum_{i=j+1}^p\phi_i$

  • P should large enough to ensure $\epsilon_t$ is White noise

Example

If we have

This series is I(1) if $(1-\phi_1-\phi_2-\phi_3)=0$

We construct the function

Problems with Unit Root tests

  • Reject $I(1)$ null too often when is true, if $\Delta Y_t$ is an ARMA(p,q) with large and negative MA component.
  • Low power against $I(0)$ alternatives that are close to being $I(1)$ .
  • Fail to reject $I(1)$ when $Y_t$ is $I(0)$ around a trend function with a break.

Self-Study Questions:

CHP 8:Q1.bWhy is it in general important to test for non-stationarity in time series data before attempting to build an empirical model?

If two series are non-stationary, we may experience the problem of ‘spurious’ regression. This occurs when we regress one non-stationary variable on a completely unrelated non-stationary variable, but yield a reasonably high value of R2, apparently indicating that the model fits well.

Most importantly therefore, we are not able to perform any hypothesis tests in models which inappropriately use non-stationary data since the test statistics will no longer follow the distributions which we assumed they would (e.g., t or F), so any inferences we make are likely to be invalid.

Unit 5:Conditional Heteroskedasticity

Reading

Activities

  • Brooks, Chapter 9, Self-Study Questions, Question 1 (except (f)), Question 3(a) and Question 5

Definition:

Jarque-Bera统计量,是用来检验一组样本是否能够认为来自正态总体的一种方法

Volatility models

Let $\mathbb{E}(r_t|\Omega_{t-1})\equiv \mu_t$ and define $\epsilon_t\equiv r_t-\mu_t$

where $\sigma_t$ is deterministic function of $\Omega_{t-1}$, $\sigma_t>0$, {$\eta_t$} is i.i.d.(0,1), $\sigma_t\in\Omega_{t-1}, \eta_t\perp \Omega_{t-1}$

So with $\eta_t\sim N(0,1)$, $\epsilon_t|\Omega_{t-1}\sim N(0,\sigma_t^2)$

此时$\epsilon_t$随时间变换,$\sigma_s^2$越大,正态分布越扁

Thickness of the tails is measured by the kurtosis

峰度反映了峰部的尖度

If $\sigma_t^2=\sigma^2$, then $\mathcal{K}_\epsilon=\mathcal{K}_\eta$

正态分布峰度为3, 如果残差图峰度过高,就表示存在异方差

Fat-tails can be modelled by a leptokurtic(尖峰厚尾) distribution of {$\eta_t$} and/or variability of {$\sigma_t^2$}

The ARCH(p) models

  • $\mathbb{E}(\epsilon_t|\Omega_{t-1})=\sigma_t\mathbb{E}(\eta_t|\Omega_{t-1})=0$, implying that, for $s\ne t$
  • $\epsilon_t,\epsilon_s$ are uncorrelated, but NOT independent!
  • $\mathbb{Var}(\epsilon_t|\Omega_{t-1})=\sigma_t^2\mathbb{E}(\eta_t^2|\Omega_{t-1})=\sigma_t^2$

Volatility Clustering (波动聚类)

  • Large past squared shocks $\epsilon_{t-i}^2(i>0)$imply a large conditional variance $\sigma_t^2$ for $\epsilon_t$
  • The magnitude of the noise is a function of its past value
  • $\mathbb{Var}(\sigma_t^2)\ne 0$, then implies that $\mathcal{K}_\epsilon>\mathcal{K}_\eta$ (Excess Kurtosis)

ARCH(1)

With $\omega,\alpha>0$

$v_t=\epsilon_t^2-\sigma_t^2=\sigma_t^2(\eta_t^2-1)$

for $\alpha<1$

If $\alpha<1$, then $\epsilon_t$ is weakly stationary.

The GARCH(p,q) models

Let

Then

and

ARMA representation of GARCH

The original equation can be rewritten as

  • using the convention $\alpha_i=0 (\beta_i=0)$ if $i>p(i>q)$

把 $\sigma_t$ 的部分替换成了 $v_t$

Bollerslev (1986), Theorem 1, shows that if

then $\epsilon_t$ is weakly stationary, and {$\epsilon_t^2$} $\sim$ ARMA(max(p,q),q)

GARCH(1,1)

and the autocorrelation function of $\epsilon_t^2$ is

GARCH models are often estimated using the maximum likelihood estimator based on the Gaussian Likelihood (GMLE)

  • If $\eta_t\sim NID(0,1)$, then $\epsilon_t|\Omega_{t-1}\sim N(0,\sigma_t^2)$

Specification strategy for GARCH models

“Specific-to-general” approach

  1. Specify an adequate model for the $\mu_t=\mathbb{E}(r_t,\Omega_{t-1})$.

  2. Test for the presence of conditional heteroskedasticity;

  3. Select p,q and estimate GARCH models (use IC);

  4. Evaluate the model by misspecifification tests, e.g. $\epsilon_t/\hat\sigma_t$should behave like a i.i.d. sequence.

  5. Estimate a more suitable GARCH model (if necessary);

Extensions of the GARCH models

Drawbacks of GARCH(p,q) Models:

  • Non-negativity constraints may be violated.

  • Symmetric response to past shocks.

The leverage effect

价格大幅度下降后往往会有同样幅度价格上升的倾向

  • Negative shocks appear to contribute more to stock market volatility than do positive shocks. (stylized fact )
  • A negative shock to the market value of equity increases the debt/equity ratio (other things the same), increasing leverage.
  • The asymmetric response of the volatility to past shocks is known as leverage effect.

The EGARCH (Exponential GARCH) Models

where

$\mathbb E|\eta|=\sqrt{2/\pi}$

  • $\sigma_t^2$ will be positive (without imposing non-negativeness restrictions on the parameters)
  • The asymmetry property is taken into account through $\theta$.
  • The leverage effect will imply that $\theta<0$.
  • Innovation of large modulus should increase the volatility, entailing that we expect

An Example

  • If $\eta_{t-1}<0(i.e.\epsilon_{t-1}<0)$, the variable $ln(\sigma_t^2)$ will be larger than its mean $\omega$.

  • If $\eta_{t-1}>0(i.e.\epsilon_{t-1}>0)$, the variable $ln(\sigma_t^2)$ will be smaller than its mean $\omega$.

Thus, we obtain the typical asymmetry property of financial time series.

The Threshold GARCH (TGARCH) model

The GJR-GARCH (Glosten, Jaganathan and Runkle) models

  • The asymmetry is accounted for through the $\gamma_i$

  • $\omega>0,(\alpha_i+\gamma_i)\ge0,\beta_j\ge0$ are sufficient for $\sigma_t^2>0$

News Impact Curves (NIC)

  • The news impact curve plots $\sigma_t^2$ against $\epsilon_{t-1}$, setting $\sigma_{t-h}^2=\sigma^2$, for $h>0$

Forecasting

Consider the AR(1)-GARCH(1,1) model

with $\omega>0,\alpha,\beta\ge0,\alpha+\beta<1,|\phi|<1$

Recall that $f_{T,h}=\phi^hX_T$

and

The 1-step-ahead forecast of $\sigma_{T+1}^2$

  • GARCH model are deterministic volatility models.

  • If the parameters are known, $\sigma_{T+1}^2$ is a deterministic function of $\Omega_T$ .

The 2-step-ahead forecast

because

In general, the h-step-ahead forecast

By recursive substitutions it follows that

  • If $(\alpha+\beta)<1$, $lim_{h\to\infty}f_{T,h}^{\sigma^2}=\sigma^2$, where
  • If $(\alpha+\beta)=1$ (IGARCH)

Heteroskedasticity and interval forecasts

The $100 \times (1 - \alpha)\%$ interval forecast for $X_{T+h}$ takes the form

It is usual to construct interval forecasts that are symmetric around the conditional mean of $X_{T+h}$

Self-Study Questions:

CHP 9:Q1.dDescribe two extensions to the original GARCH model. What additional characteristics of financial data might they be able to capture?

EGARCH, GJR or GARCH-M. The first two of these are designed to capture leverage effects. These are asymmetries in the response of volatility to positive or negative returns. The EGARCH model also has the added benefit that the model is expressed in terms of the log of ht, so that even if the parameters are negative, the conditional variance will always be positive. GARCH-M model allows the lagged value of the conditional variance to affect the return.

Unit 6:Multivariate Time Series

Reading

  • Brooks, Chapter 1, Section 1.7 (Essential - Week 6)
  • Brooks, Chapter 7, Sections 7.10 - 7.16 (Essential - Week 7)
  • Diebold, Chapter 16 (Recommended)

Activities

  • Brooks, Chapter 1, Self-Study Questions 21, 22.

Definition:

We consider n possibly related time series $X_{1t}. . . , X_{nt}$, and
define the $n$−vector

Weak Stationarity

  • $\mathbb E(X_t)\equiv \mu < \infty$ for all $t$
  • $\mathbb E(X_t-\mu)(X_t-\mu)’\equiv\Gamma(0)<\infty$ for all $t$
  • $\mathbb E(X_t-\mu)(X_{t-h}-\mu)’\equiv\Gamma(h)$ for all $t$

The expectation of a vector

The variance-Covariance matrix

When $h=0$

When $h\ne0$

Diagonal matrix

Correlation Matrix

The multivariate white noise

Let $\epsilon_t\equiv[\epsilon_{1t},…,\epsilon_{nt}]’$

  • $\mathbb E(\epsilon_t)=0$
  • $\Gamma(h)=\mathbb E(\epsilon_t\epsilon_{t-h})=\begin{cases}\Sigma&for h=0\\0&for h\ne 0\end{cases}$

For individual shocks

  • $\mathbb E(\epsilon_{it})=0$
  • $\mathbb E(\epsilon_{it}^2)=\sigma_i^2$
  • $\mathbb E(\epsilon_{it}\epsilon_{i,t-h})=0$

We do allow for contemporaneous correlation

  • $\mathbb E(\epsilon_{it}\epsilon_{jt})=\sigma_{ij}$
  • $\mathbb E(\epsilon_{it}\epsilon_{j,t-h})=0$

Vector Autoregressive Processes (VAR)

Let $Y_t\equiv X_t-\mu$. A VAR(p) process satisfifies the equation

and $I_n$ denotes the n−dimensional identity matrix.

Example

For $n = 2, p = 2$

Stationarity of a VAR

Consider the VAR(1) process

By recursively substituting

when $max_{j=1,…,n}|\lambda_j(\Phi_1)|<1$, $\Phi_1^t\to 0$ as $t\to\infty$

The previous condition can be restated as

is equivalent to

If the above conditions are satisfified

A VAR(p) process is stationary if

Remarks

The VAR(p) processes belong to the class of VARMA(p,q) processes

where

VARMA(p,q) processes are difficult to estimate when $q > 0$. VAR models are usually used in empirical applications.

Applying VAR models

Supposed that the appropriate order p for the VAR model is found, that is, a VAR(p-1) is misspecified and VAR(p+1) contains too many redundant parameters, and the parameters are estimated.

The resultant empirical model can be used for various purposes.

  1. Out-of-sample forecasting.
  2. Granger causality analysis.
  3. Structural analysis (impulse response function and error variance decomposition).

Forecasting

Consider the VAR(1) model. The h−step forecast at time T is given by

The forecast error covariance matrix equals

【Example】

Granger Causality

An important statistical notion of causality that’s intimately related to forecasting is based on two key principles

  • Cause should occur before effect.

  • A causal series should contain information useful for forecasting that in not available in other series.

We can say that $X$ Granger-cause $Y$ , $“X \Rightarrow Y ”$ if

It $Z_t=[X_t,Y_t]’$ can be represented as a bivariate VAR

then $X\not\Rightarrow Y$ if $\phi_{12}(L)\equiv 1+\sum_{j=1}^p\phi_{12,j}L^j=0$

The the equation for $Y_t$

then, the hypothesis $X\not\Rightarrow Y$ is equivalent to

If the VAR is stationary an F−test can be used.

Impulse Response Functions(脉冲响应函数)

To understand what are the dynamic effects of the error process $\epsilon_t$ on $Y_{t+h}$, one can calculate the so-called impulse-response function.

with $\Sigma=diag[\sigma_1^2,…,\sigma_n^2]$

Suppose there is an interest in the effect of shocks corresponding to the first variable.

One can then calculate

and

The n elements of the $V_k$ vector series, $k = 0, 1, 2, . . . , h$ are called the impulse-response functions

The i-th impulse-response function is the trajectory $\{v_{ik},k=0,1,…,h\}$

If we consider a VAR(p) stationary process

The impulse-response function is defined as

k 时间上 j 对 i 的影响

$v(i,j,k)$ represent the response of $Y_{it}$ to a unitary shock in $Y_{j,t-k}$ (produced by a unitary shock in $\epsilon_{j,t-k}$.)

$\Sigma$ is not required to be diagonal, so the component of $\epsilon_t$ may be contemporaneously correlated.

If the correlations are high, there is no way of separating the response of $Y_{it}$ to a shock on $\epsilon_{j,t−k}$ from its response to other shocks that are correlated to $\epsilon_{j,t−k}$.

If we defifine the square invertible matrix $S$ such that

then, setting $\tilde\Psi_j\equiv\Psi_jS,\xi_t=S^{-1}\epsilon_t$

让原本相关性强的残差变成白噪

We can define the structural impulse-response function as

  • Note that if $\epsilon_t=S\xi_t$, then $\Sigma=SS’$

$S$ is often defined as a lower triangular matrix (Choleski decomposition)

The fact that $S$ is triangular implies that $\epsilon_{1t}$ is a function of the first structural shocks $\xi_{1t}, \epsilon_{2t}$ is a function of the structural shocks $\xi_{1t}$ and $\xi_{2t}$ and so on.

Example

Let $Y_t=\Phi Y_{t-1}+\epsilon_t,\epsilon_t\sim(0,\Sigma)$, with

Hence,

and so on.

Note that $\Sigma=SS’$, with

which is equivalent to $\tilde\Psi_0$. Moreover

and so on.

In practice $ \Sigma$ is estimate by

and the Cholesky decomposition is computed on $ \hat\Sigma$, that is

Forecast error variance decomposition

The orthogonalized error variance decomposition is defined as

Expressed as percentage, one can thus examine the relative importance of the error variance $Y_j$ for forecasting $Y_i$ h-step ahead.

Nonstationary VAR

Consider the n−variate VAR

The VAR process is stationary if

Recall that

where $\Phi_j^*=\sum_{k=j+1}^p\Phi_k$ and $\Pi\equiv\Phi(1)$

Beveridge-Nelson Decomposition 取 I(1)

Augmented Dickey Fuller (ADF) Test

  • If $\Pi$ is a matrix of zeros

  • If $\Pi\ne 0$ and $det(\Pi)\ne0$ the process is stationary.

  • If $\Pi\ne 0$ but $det(\Pi)=0$, that is Π has reduced rank, then $Y_t$ is a cointegrated process.

    协整检验的目的是决定一组非平稳序列的线性组合是否具有稳定的均衡关系

  • If $rank(\Pi)=r<n$, then

    where $\alpha$ and $\beta$ are $(n \times r)$ full column rank matrices.

    Hence, the VAR admits the representation

    where $\beta’Y_t$ needs to be stationary!

Unit 7:Cointegration

Reading

  • Brooks, Chapter 8, Sections 8.3 - 8.11 (Essential)
  • Appendix C (Essential)

Activities

  • Brooks Chapter 8, Self-Study Questions 5,6,7

Definition:

Cointegration

An n-variate time series {$X_t$} is called a cointegrated system of order ($d,b$), written {$X_t$} $\sim CI(d, b)$, if all the components are I($d$) and there exists a $n \times r(r < n)$ matrix $\beta \ne 0$ such that $\beta ‘X_t \sim I(d − b)$, with $d \ge b > 0$. The vectors $\beta_k, k = 1, . . . , r$ are called the cointegrating vectors.

Examples of possible Cointegrating Relationships in finance:

  • Spot and futures prices, spot and forward prices.
  • Bid and ask prices.
  • Ratio of relative prices and an exchange rate (law of one price)
  • Equity prices and dividends

Market forces arising from no arbitrage conditions should ensure an equilibrium relationship

The Engle-Granger Method(协整检验)

Link

  • If the null is not rejected, $\beta=[1,-\gamma’]$ is not a cointegrating vector. (ADF Test)

Stock prices and Dividends

The dividend-price ratio $\Lambda_t$ is usually measured as

  • $D_t$: the sum of dividends paid on the stock/index over the previous year

  • $S_t$: the current price of the stock/ current level of the index

Hence,

Assume $\lambda_t=\lambda+u_t$, with $u_t\sim I(0)$

Let

which can be rewritten as

and we have

This model can be generalized to

  • known as an Error Correction Model (ECM)

  • In an ECM, the change in a variable depends on the deviations from some equilibrium relation

Let $x_t=(s_t,d_t)’$ and assume that

Then,

Consider 3 situations:

  1. $d_{t-1}-s_{t-1}-\lambda=0$ (equilibrium)

    where $\mu$ represent the growth rates of stock prices (and dividends) in the long-run.

  2. $d_{t-1}-s_{t-1}-\lambda>0$ (positive disequilibrium error)

    If $\alpha<0$, the model predicts that $d_t$ will grow more slowly than its long-run rate to restore the dividend-yield to its long-run mean.

  3. $d_{t-1}-s_{t-1}-\lambda<0$ (negative disequilibrium error)

    If $\alpha<0$, the model predicts that $d_t$ will grow faster thani ts long-run rate to restore the dividend-yield to its long-run mean.

VARs with Integrated Variables

we have

can be re-written as

  • $\Phi_j^*=-\sum_{k=j+1}^p\Phi_k$ and $\Pi \equiv -\Phi(1)$

  • $\Pi$ is know as the long-run matrix

  • The representation is the multivariate counterpart of the ADF regression

  • If $det(\Pi) = 0$ then the $VAR(p)$ is said to have at least a unit root

  • If $det(\Pi) = 0$, then $rank(\Pi) = r < n$.

  • If $r = 0$, then $\Pi = 0$ (n unit roots), and $\Delta X_t$ is a $VAR(p − 1)$.

  • If $0< r < n$, then

  • $\alpha$ and $\beta$ are $(n \times r)$ matrices.

  • $\beta$ is know as cointegrating vectors

  • The system will contain only $(n − r)$ unit roots

  • we obtain the Vector Error Correction Model (VECM)

Let $A$ a $(r \times r)$ non-singular matrix. Then,

Let

and assume that $\beta_1$ is invertible. To identify the matrices, the following normalization is often imposed

Testing for Cointegrating

Johansen (1988) proposed to estimated the model using the MLE, assuming the Gausssianity of {$\epsilon_t$}

MLE estimation is based upon a known value of the cointegrating rank r (usually unknown).

The maximum eigenvalue statistic, $λ_{max}$ has been proposed to the the hypothesis

The statistic $λ_{max}$ is constructed as follow:

  • A positive semi-definite matrix $\Xi$ having the same rank of
    $\Pi$ is defined. Let $\lambda_k\equiv\lambda_k(\Xi)$, $k = 1, . . . , n$. By definition

  • Obtain a consistent estimate $\hat{\Xi}$. Then $\lambda_k(\hat{\Xi})$ are consistent estimates of $\lambda_k, k = 1,…, n$

  • Test the null hypothesis $H_0 : −T ln(1 − \lambda_{r_0+1}) = 0$, for $r_0 = 0, …, n − 1$ until we fail to reject

  • Let $r^∗_0$ the value of $r_0$ for which the null hypothesis cannot
    be rejected for the first time. Then, $\hat r = r^∗_0$

The trace statistic $λ_{trace}$ has been proposed to the the hypothesis

The statistic $λ_{trace}$ is constructed as

  • Test the sequence of null hypotheses $H_0 : λ_{trace}(r_0) = 0$, from $r_0 = 0$ to $r_0 = n − 1$ until we fail to reject

$r_0$ 按顺序依次取

Forecasting, Causality and IRFs

Causality Testing in VECMs

Assume $n = 2$ and $X_t = (Y_t , Z_t)’$. The VECM taks the form

where

The hypothesis that $Z$ does not Granger-cause $Y$ may be formalized as

Testing the hypothesis is not straightforward.

Alternative approaches (as the lag-augmented VAR) have been proposed.

Forecasting Integrated and Cointegrated systems

  • Forecasting can be discussed in the framework of the levels VAR representation.

  • A VECM may be rewritten in VAR form, or forecasting can be obtained directly from the VECM.

    If a variable enters the system in differenced form only, it is, of course still possible to generated forecasts of the levels. See Brooks, Rew and Ritson (2001), Section 4.

Impulse Response Analysis

In principle, impulse response analysis in cointegrated systems can be conducted in the same way as for stationary systems.

Self-Study Questions:

CHP 8:Q6.aSuppose that a researcher has a set of three variables, she wishes to test for the existence of cointegrating relationships using the Johansen procedure. What is the implication of finding that the rank of the appropriate matrix takes on a value of 0,1,2,3

If the rank of the matrix is zero, this implies that there is no cointegration or no common stochastic trends between the series. The rank of is one or two would imply that there were one or two linearly independent cointegrating vectors or combinations of the series that would be stationary, respectively. A finding that the rank of is 3 would imply that the matrix is of full rank. The implication of a rank of 3 would be that the original series were stationary, and provided that unit root tests had been conducted on each series, this would have effectively been ruled out.

Unit 8:Forecasts Evaluation (Part I)

Reading

  • Brooks, Chapter 6, Section 6.10 (Essential)
  • Diebold, Chapter 10 (Essential, Subsections 10.2.2 and 10.2.4 excluded )

Activities

  • Brooks Chapter 6, Self-Study Questions 11(d), 12(f).
  • Diebold, Section 10.4, Exercises, Problems and Complements 1-3. The solutions can be found here (See Chapter 12, Exercises 1.2.6).

Definition:

Rolling vs Recursive Samples

  • Rolling (M1-M12;M2-M1;M3-M2;…)
  • Recursive(M1-M12;M1-M1;M1-M2;…)

Testing Properties of Optimal Forecasts

  1. Have a zero mean (unbiasedness).

  2. 1-step-ahead optimal forecast errors are white noise.

  3. h-step-ahead optimal forecast errors are at most MA (h 1).

  4. The h-step-ahead optimal forecast error variance is non-decreasing in h.

Hypothesis 1

  • Regress $e_{t,h}$ on a constant and use the reported t−statistic:

  • Multi-step-ahead forecast errors will be serially correlated because the forecast periods overlap.

  • Fitting a MA(h − 1) is a good initial guess to model autocorrelation in the regression error.
  • Robust standard errors can be used (e.g. Newey-West estimator).

Hypothesis 2

1-step-ahead optimal forecast errors are white noise.

  • Regress the $e_{t,1}$ on a constant test the null hypothesis that the residuals are white noises

Hypothesis 3

The h-step ahead optimal forecast errors are at most MA (h 1).

  • Examine the statistical signifificance of sample ACF(k), k > (h 1) using Bartlett’s standard errors.

  • Regress $e_{t,h}$ on a constant, allowing for MA(q) disturbances, with q > (h − 1) and test.

Hypothesis 4

The h-step ahead optimal forecast error variance is non-decreasing in h.

Is {$e_{t,h}$} orthogonal to available information?

Mincer and Zarnowitz (1969) proposed to test partial optimality with respect to $f_{t+h,t}$ using the regressions

or, equivalently

If the forecast is optimal with respect to the information set used to construct it, then we’d expect

Ramsey (1969)’s test account for various sort of nonlinearity

If the forecast is optimal with respect to the information set used to construct it, then we’d expect

In general $\Omega_t$ doesn’t include all information available at the time the forecast was made. We will deal with forecasts that will be at most partially optimal.

Relative standards for point forecasts

Accuracy ranking via expected loss

  • Forecast accuracy is measured with respect to a loss function, $\mathscr L(e_{t,h})$, and the forecast horizon h

Forecast error

Accuracy measure are defined on the forecast errors

or the percent forecast error

Mean error

  • The ME measures the bias
  • If ME> 0, then on average we are “under-forecasting” (“over-forecasting” if ME < 0)
  • Other things the same, we prefer a forecast whose error have small bias.

Error Variance

  • EV measures the dispersion of the forecast error.
  • Other things the same, we prefer a forecast whose error have small variance.
  • Although ME and EV are components of accuracy, neither provides an overall accuracy measure.

Mean Square Error

The most common overall accuracy measures is the mean squared error

In sample we write

Mean Square Error Decomposition

Bias-Variance trade-off: a small bias increase is acceptable in exchange for a massive variance reduction.

Root Mean Square Error

Often the square roots of MSE is used to preserve units.

Suppose that the forecast errors are measured in dollars, then the MSE is measured in dollars squared.

Taking the square root brings the unit back to dollars.

Mean Absolute Error

Somehow less popular, but nevertheless common measures are the mean absolute error

Quadratic 的 loss function 会放更多 weight 在大的值

Statistical comparison of forecast accuracy

  • Suppose that two different forecasts of the same object are available, say $f_{t,h}^{(1)}$ and $f_{t,h}^{(2)}$.

  • Suppose that $\hat{MSE}^{(1)}(h) < \hat{MSE}^{(2)}(h)$

  • One is tempted to conclude that $f_{t,h}^{(1)}$ “wins” under the
    quadratic loss function ($\mathscr L$).

  • The Diebold-Mariano test answers that question.

  • In hypothesis testing terms, we might want to test the null of equal predictive accuracy

    against the alternative that one forecast is better, i.e.

    against $\mathbb E(d_{12,t})\ne 0$, where $d_{12,t}=e_{t,h})^{(1)}-\mathscr L(e_{t,h})^{(2)}$

The Diebold-Mariano Test

  • Diebold and Mariano assume that $d_{12,t}$ is covariance stationary, i.e., for every t

    and short memory, i.e. $\sum_{k=-\infty}^\infty|\gamma_d(k)|<\infty$

  • Under $H_0:\mathbb E(d_{12,t})=0$

    where $\bar d_{12} = (N − h + 1)^{−1}\sum^{N−h}_{t=0} d_{12,t}$ has an estimated standard deviation equal to $\hat\sigma_{\bar d_{12}}$

Benchmark Comparisons: the predictive $R^2$

To assess the forecasts, one might compare a forecast to a “naive” competitor .

The Predictive $R^2$

considers $\bar Y$ as a benchmark for $f_{t,1}$.

  • The (estimate of) 1-step-ahead out-of-sample forecast error variance is compared to an estimate of unconditional variance.
  • The predictive $R^2$ should be close to 1 if the forecast is by far more accurate than $\bar Y$ .
  • The h-step-ahead version of the predictive $R^2$ is defined replacing $e_{t,1}$ with $e_{t,h}$.

Benchmark Comparison: Theils’s U-Statistic

The Theil U-Statistic is similar to a predictive $R^2$ , but the benchmark changes from $\bar Y$ , to a “no change” forecast

Many economic variables may in fact be nearly random walk.

In this case the forecaster will have great difficult beating the random walk (RW), for which $f_{t,1} = Y_{t}$.

此方法不能运用于 GARCH,因为 Y 部分

Evaluating direction-of-change-forecasts

  • In terms of profitability of a trading strategy, a forecast can be assessed in terms of the ability to predict direction changes irrespective of their magnitude.
  • The accuracy of the forecast is measured in terms of % correct sign prediction.
  • We can also test the null hypothesis of no predictive power.

Introduce the indicator variables

Let $P_y = Pr(Y_{t+1} > 0)$ and $P_f = Pr(f_{t,1} > 0)$. Define

Denote by $\hat P$ the proportion of times that the sign of $Y_{t+1}$ is predicted correctly, i.e.

  • The assessment is based on a test for the null that $Z_t^y$ and $Z_t^f$ are independent (no predictive power).

  • Under the null, $N\hat P$ has a binomial distribution with mean $NP_∗$, where

    and

    with

  • A nonparametric test of predictive performance of $f_{t,1}$​ can be based on

    如果是 iid $P(xy)=P(x)P(y)$, 即 $\hat P-\hat P_*=0$

    where

    and

Unit 8:Forecasts Evaluation (Part II)

Activities

  • Brooks Chapter 6, Self-Study Questions 11(d), 12(f).
  • Diebold, Section 10.4, Exercises, Problems and Complements 1-3. The solutions can be found here (See Chapter 12, Exercises 1.2.6).

Definition:

Evaluating Volatility Forecasts

We can compute the forecasts $f_{t,1}^{\sigma^2}$ but we don’t observe $\sigma_{t+1}^2$

If $\epsilon_t=\sigma_t\eta_t$, $\eta_t\sim iid(0,1)$, a popular proxy is $\epsilon_{t+1}^2$, because $\mathbb E(\epsilon_{t+1}^2|\Omega_t)=\sigma_{t+1}^2$​.

$\sigma_{t+1}^2$ 依附于 $\Omega_t$

Assume we have generated the series of 1-step-ahead-point forecasts $\{f_{T+j,1}^{\sigma^2}\}_{j=1}^N$

To simplify the notation we write

For example, the MSE is computed as

$\epsilon_t^2$ is an unbiased proxy of $\sigma_t^2$

Alternative volatility proxies have been proposed (see Section 9.18 Brooks).

If $\epsilon_t=\sigma_t\eta_t$, $\eta_t\sim NID(0,1)$, then

  • i.e. $\epsilon_t^2<1/2\sigma_t^2$​ more than fifty percent of the time.

not a good proxy

Even if one is willing to accept a proxy that is up to 50% different from $\sigma_t^2$ , $\epsilon_t^2$ would fulfil this condition only 25% of the time

25%的可能 $\epsilon^2$ 与 $\sigma^2$ 差 50% 的比

Transforming Volatility Forecasts into Probability Forecasts

Lopez (2001) proposed an alternative forecast evaluation framework.

If $\epsilon_t=\sigma_t\eta_t$, $\eta_t\sim D(0,1)$, and $\sigma_t^2$ is a predictable function of $\Omega_{t-1}$, then

volatility forecasts can be readily transformed into probability forecasts.

Out-of-sample $P_{t|t-1}$ for $t = 1, . . . , N$ is the one-step-ahead probability forecast conditional on $\Omega_{t-1}$

Probability Forecasts

Let $Y_t=\mu_t+\epsilon_t$

Suppose for example that a Central Bank is interested in forecasting whether the exchange rate ($Y_t$) will remain within a target zone

In such a case, the event of interest is

where $L_t$ and $U_t$ are fixed by the Central Bank (forecast user).

Assuming that D is continuous and $\mu_t$ and $\sigma_t$ are known

  • Out aim is to forecast $Pr (L_t \leq Y_t \leq U_t)$ at time $t − 1$.

Then, the one-step-ahead probability forecast $P_t$ is computed as

Remark

  • To avoid cumbersome notation, we will use $X_{t|t-1}$ to denote the one-step-ahead forecast of a variable $X_t$ , conditional on $\Omega_{t-1}$.

  • As an example of probability forecast evaluation, we will consider the Brier score.

  • The Brier score is a rough analogue of the MSE for probability forecasts.

Accuracy measures for probability forecasts are commonly called scores.

The most common is the Brier quadratic probability score

  • where $R_t$ takes value one if the event occurs and zero otherwise.

$QPS\in [0, 2]$ and smaller values indicate more accurate forecasts.

Evaluating Interval Forecasts

The Lopez approach to volatility forecast evaluation is based on time-varying probabilities assigned to fixed intervals.

Alternatively, one may fix the probabilities and vary the widths of the intervals, as in the traditional confidence intervals construction.

The objective is to construct a sequence of out-of-sample interval forecasts $\{[L_{t|t-1}(\alpha),U_{t|t-1}(\alpha)]\}_{t=1}^N$.

The sequence of ex-ante forecast interval for time t made at time $t − 1$ have coverage probability $(1 − \alpha)$.

We are going to consider the approach proposed by
Christoffensen (1998)

Defifinition 1 (Indicator variable)

The indicator variable It for a given interval forecast $[L_{t|t-1}(\alpha),U_{t|t-1}(\alpha)]$ is defined as

  • $I_t$ is a function of $Y_t$, so $I_t$ is a random variable

Defifinition 2 (Testing Criteria)

We say that the sequence of interval forecasts

is efficient with respect to information set $\Omega_{t-1}$, if

If $\Omega_{t−1} = \{I_{t−1}, I_{t−2}, . . . , I_1\}$, it can be shown that

is equivalent to

Defifinition 3 (Conditional coverage)

We say that a sequence of interval forecasts

has correct conditional coverage if

Standard evaluation methods for interval forecasts compare the nominal coverage

to the true coverage

The interval forecast might be correct on average, but the conditional coverage might be characterized by clustered outliers.

LR test of unconditional coverage

Consider the indicator sequence $\{I_t\}^N_{t=1}$ constructed from a given interval forecast.

To test the unconditional coverage, the hypothesis

should be tested, given independence

Let $n_j$ be the number of observations for which $I_t = j$, $j = 0, 1$. By construction $n_0 + n_1 = N$.
The likelihood under the null hypothesis is

with $p_\alpha=1-\alpha$, and under the alternative

Testing for unconditional coverage can be formulated as a likelihood ratio test,

where $\hat\pi = n_1/(n_0 + n_1)$ is the MLE of $\pi$

  • The test has no power against the alternative that zeros and ones come clustered in a time-dependent fashion.
  • Testing for correct unconditional coverage is insufficient when dynamics are present in the higher-order moment.

  • Interval should be narrow in tranquil times and wide in volatile times, so that occurrences of observations outside the interval forecast would be spread out over the sample and not come in clusters.

  • An interval that fails to account for higher-order dynamics may be correct on average, but in any given period it will have incorrect conditional coverage characterized by clustered outliers.

The two tests presented in the next

LR test of independence

Recall that a sequence of interval forecasts has correct condition coverage if $\{I_t\} \sim^{iid} Bern(p_\alpha)$.

The “independence part” will be tested against an explicit first-order Markov alternative.

Consider a binary first-order Markov chain $\{I_t\}$, with transition probability matrix

where $\pi_{ij}=Pr(I_t=j|I_{t-1}=i)$

The likelihood function for this process is

where $n_{ij}$ is the number of observations with value $i$ followed by $j$.

The MLE(Maximum Likelihood Estimate) of $\Pi$ is

Under the null of independence, $\Pi$ simplifies to

The likelihood function under the null became

The MLE of $\pi$ is

The LR test of independence is

Next we will test jointly for independence and correct probability parameter $\pi_{ind}=p_\alpha$.

Combining the two test we get a complete test for correct conditional coverage (cc).

The Joint Test of Coverage and Independence

The main idea is to test the null of unconditional coverage against the alternative of the independence test.

Christoffersen (1998) defifine the test for correct

If the likelihood is computed conditional to the first observation (as we did), then

Example

Christoffersen (1998) evaluate the forecast methodology suggested by J.P. Morgan’s (1995) RiskMetrics using Exchange rates data for different countries.

He considers the model

The RiskMetrics interval forecasts is tested against two peers.

The interval forecast suggested by J.P Morgan is

where

Here $\lambda$ is fixed at 0.94

The distribution $D(·)$ is Gaussian, and its c.d.f. is denoted by $\Phi(·)$.
Let $c_\alpha = \Phi^{−1} (\alpha)$, then $c_\alpha$ satisfies $P(Y_{t+1}/\sigma_{t+1}\leq c_\alpha)=\alpha$

The first peer is constructed from an estimated GARCH(1,1) model with a Student’s t−innovation.

If the Student’s t−distribution has 4 degrees of freedom with c.d.f. $\tau$ , for $\alpha = 95%$ we have $\tau^{-1} (\alpha/2) = −2.776$ and $\tau^{−1}(1 − \alpha/2) = 2.776$.

To compute the interval, the unknown parameters in $\hat\sigma_{t+1|t}^2$ need to be replaced by estimates.

The second peer is a simple static forecast, constructed assuming that $D(·)$ is Gaussian.

Let $F(·)$ the unconditional, time-invariant c.d.f. of $Y_{t+1}$

Comparing interval forecasting of daily exchange rates.

  • Tests for UC, IND, CC across coverage rates and exchange rates.
  • Nominal coverage rate.
  • Average width of the interval prediction.
  • See Christoffersen (1996) Section 5 for further details

【课程】Modelling and Forecasting Financial Markets
http://achlier.github.io/2021/05/18/Modelling_and_Forecasting_Financial_Markets/
Author
Hailey
Posted on
May 18, 2021
Licensed under