【课程】Modelling and Forecasting Financial Markets
基于格拉斯哥ECON5022的笔记;
General information
The course offers an introduction to modelling and forecasting financial time series. The first part of the course will be mainly devoted to analysing univariate models for the conditional mean and the conditional variance (ARMA and GARCH models). These models will be used to produce forecasts. Additional topics, e.g. multiple time series analysis and nonlinear models may be discussed, if time allows. In the second part of the course will discuss forecasts evaluation, aimed to monitor and improve forecast performances. The course will be complemented by practical session using statistical or econometric software.
Aims
The main aims of the course are to introduce the basic models widely used to analyse and forecast financial time series, and to evaluate the forecast produced using these models.
ILOs
By the end of this course, students will be able to:
- Select and fit the appropriate model to analyse financial time series.
- Derive the main properties of the models used to analyse and forecast financial time series.
- Produce optimal forecasts for a given information set and forecast horizon.
- Evaluate critically the forecasts.
- Model and predict financial time series using statistical/econometric software.
- Work collaboratively in a group to produce a combined output, by liaising with other class members, allocating tasks and co-ordinating.
Unit 1:Time Series and Their Features
Reading:
- Brooks, Chapter 2, Section 2.7 (Essential)
- Brooks, Chapter 6, Sections 6.1-6.2 (Essential)
Students are supposed to be familiar with the topics covered in the sections below:
- Brooks, Chapter 1, Sections 1.5-1.6.
- Brooks, Chapter 2, Section 2.1-2.5.
Activities:
- Brooks, Chapter 1, Self-Study Questions 6, 10, 23.
- Brooks, Chapter 2, Self-Study Questions 1-6, 9.
Definition:
Most financial studies involves returns, instead of prices, of assets
- Return of an asset is a scale-free summary of an investment opportunity.
- Return series are easier to handle than prices because of the more attractive statistical properties.
Time Series Data
- The term time series is used to mean both the data $\{x_t\}$ and the process $\{X_t\}$ of which it is a realization.
- $\{x_t\}$ and $\{X_t\}$ are the shorthand notations for $\{x_t , t \in T_0\}$ and $\{X_t , t \in T_0\}$ when it is not necessary to specify $\mathbb{T}_0$ where $\mathbb{T}_0$ is a discrete set..
Mean function
covariance function
Strict Stationarity:
The process is said to be strictly stationary if thejoint distribution of $(X_{t_1}, X_{t_2}, . . .X_{t_n})’$ and $(X_{t_1+h}, X_{t_2+h}, . . .X_{t_n+h})’$ are the same for all the positive integers $n$ and for all $t_1, t_2, . . . , t_n$
Weak Stationarity:
- $\mathbb{E}|X_t|^2<\infty$,
- $\mu(t)=\mu \forall t \in \mathbb{Z}$
- $\gamma(t+h,t)=\gamma(h) \forall t\in\mathbb{Z},h\in\mathbb{N}$
Autocovariance function (ACVF)
Autocorrelation function (ACF)
A weakly stationary time series $\{X_t\}$ is not serially autocorrelated if $\rho(h) = 0$ for all $h > 0$.
For a given sample $\{x_t\}$
If $\{X_t\}$ is a sequence of independent and identically distributed (iid) random variables satisfying $\mathbb{E}|X_t|^2<\infty$
The interval $[-\frac{1.96}{\sqrt{T}},\frac{1.96}{\sqrt{T}}]$ is the 95% non-rejection region for the null $\rho(h) = 0$. We can use the confidence interval to test individual ACFs.
Box and Pierce (1970) proposed the statistic
as a test static for $H_0 : \rho(1) = · · · = \rho(m) = 0$ versus $H_1 : \rho(h) \not= 0$ for some $k \in \{1, . . . , m\}$. The null is rejected if $Q^∗ (m)$ lies in the upper tail of $\chi^2 (m)$.
Ljung and Box (1978) modified the statistic to improve the small sample properties of the test
Self-Study Questions:
【CHP 2:Q5】Which is a more useful measure of central tendency for stock returns − the arithmetic mean or the geometric mean? Explain your answer.
The geometric return is always less than or equal to the arithmetic return, and so the geometric return is a downward-biased predictor of future performance.
If the objective is to summarise historical performance, the geometric mean is more appropriate, but if we want to forecast future returns, the arithmetic mean is the one to use.
【CHP 2:Q10】Real Return (Simple Return -Inflation)
Unit 2:Linear Models for Stationary Time Series
Reading:
- Brooks, Chapter 6, Sections 6.1-6.8 (Essential)
- Diebold, Chapter 6, Sections 6.1-6.6 (Recommended)
- Diebold, Chapter 7, Sections 7.1-7.2 (Recommended)
Activities:
- Brooks, Chapter 6, Self-Study Questions 1-9
Definition:
White noise
- mean 0 and variance $\sigma^2$, $\epsilon_t\sim WN(0,\sigma^2)$
- $\gamma(h)=\begin{cases}\sigma^2&if h= 0\ 0&if h\ne 0 \end{cases}$
- If iid $\to$ a strong white noise
Moving Average Models (MA)
$X_t = \epsilon_t+\theta_1\epsilon_{t-1}+\theta_2\epsilon_{t-2}…, \epsilon_t\sim WN(0,\sigma^2)$
$\gamma(h)=\begin{cases}\sigma^2(\theta_h+\sum_{j=1}^{q-h}\theta_{h+j}\theta_j)&0\leq h\leq q\\0&h>q\end{cases}$
- MA(q) Test by ACF
MA(1)
- $X_t=\mu+\epsilon_t+\theta\epsilon_{t-1}$
- For $Y_{t-k}=X_{t-k}-\mu=\epsilon_{t-k}+\theta\epsilon_{t-k-1}$
By backward substitution
- If $k\to\infty and |\theta_1|<1$, $Y_t=\sum_{j=1}^\infty\theta^jY_{t-j}+\epsilon_t$
Autoregressive Models (AR)
- $X_t = \phi_0+\phi_1X_{t-1}+\phi_2X_{t-2}…+\epsilon_t, \epsilon_t\sim WN(0,\sigma^2)$
- AR(p) Test by PACF
AR(1)
Summation for geometric sequence:$S_n=a_1+…+a_n=a_1\frac{1-q^n}{1-q}$
we obtain
- If $k\to\infty and |\phi_1|<1$, $X_t=\mu+\sum_{j=0}^\infty\phi_1^j\epsilon_{t-j}$
Yule-Walker
For AR(p) we have
Times $Y_{t-1}$ at both side and taking $\mathbb{E}$, with $\mathbb{E}(Y_t)=0$
Linear projection of $X$ onto $\{1, z_1, . . . , z_n\}$
where $a_0, . . . , a_n$ satisfy
同等于OLS中$\sum_{i=1}^n\hat{u}_ix_i=0$, 此处是$z_j$对$X$的回归
Partial Autocorrelation Function (PACF)
- $\alpha(0)\equiv 1$
- $\alpha(1)\equiv\rho(1)=\mathbb{Corr}(X_2,X_1)$
- $\alpha(2)=\mathbb{Corr}(e_2,e_1)=\frac{\rho(2)-\rho^2(1)}{1-\rho^2(1)}$
- $\alpha(h)\equiv\mathbb{Corr}(e_{h+1},e_1), h>1$
- $e_{h+1}=X_{h+1}-\mathscr{P}[X_{h+1}|1, X_2,…, X_h]$
- $e_{1}=X_{1}-\mathscr{P}[X_1|1, X_2,…, X_h]$
- $\alpha(h)=\phi_{hh}$
- $\mathscr{P}[X_{h+1}|1, X_1,…, X_h]=\phi_{h0}+\phi_{h1}X_h+…+\phi_{hh}X_1$
- $\alpha=\Gamma_p^{-1}\gamma(p)$
If ACF and PACF are both declining geometrically $\to$ ARMA(1,1)
Box and Jenkins (1976) suggest the following approach:
- Identification
- Estimation
- Diagnostic Checking
Information criteria
The residuals satisfy the equation ARMA(p,q)
Let $k=p+q+1$. The information criteria are often written as
- AIC = $ln(\hat{\sigma}^2)+2\frac{k}{T}$
- SBIC = $ln(\hat{\sigma}^2)+\frac{k}{T}ln(T)$
- HQIC = $ln(\hat{\sigma}^2)+\frac{2k}{T}ln(ln(T))$
where $\hat{\sigma}^2=T^{-1}\sum_{t=1}^Te_t^2$
Select p,q to minimize the information criteria.
lag operator $L$
$L^kX_t=X_{t-k}$
The AR(p) process can be rewritten as
- $\Phi(L)X_t=\phi_0+\epsilon, \Phi(L)=1-\phi_1L-\phi_2L^2-…-\phi_pL^p$
- The solutions of the equation $\Phi(z)=0$ are called the roots of the polynomial.
Stationarity of an AR(p) process
If $\Phi(z)\ne 0$ for $|Z|\leq 1$, then $\{X_t\}$ is stationary and causal, and the following causal representation
where
- $\mu = \frac{\phi_0}{\Phi(1)}$
- $\Psi(z)=1+\psi_1z+\psi_2z^2+…=\Phi^{-1}(z)=1/\Phi(z)$
- For AR(1), $\psi_j=\phi^j$ by Taylor Expansion
- For AR(2), $\psi_0=1,\psi_1=\phi_1,\psi_j=\phi_1\psi_{j-1}+\phi_2\psi_{j-2}$
Stationarity only need $|Z|\ne 1$, So we can find the stationary solution of AR(1) with $|\phi|>1$
Equation can be rewritten as
we get
ARMA models
Autocorrelation Function
Causal Processes
Invertible Processes
We have
where
$\{Y_t\}$ is stationary and causal if and only if
$\Phi(z)\ne 0 for all |z|\leq1$
and the coefficients are determined by the relationship
$\Psi(z)=\sum_{j=0}^\infty\psi_jz^j=\frac{\Theta(z)}{\Phi(z)}$
$\{Y_t\}$ is invertible if and only if
$\Theta(z)\ne 0 for all |z|\leq1$
and the coefficients are determined by the relationship
$\Pi(z)=\sum_{j=0}^\infty\pi_jz^j=\frac{\Phi(z)}{\Theta(z)}$
The ARMA(p,q) models is often preferred for parsimony reasons:
$\frac{\Theta(z)}{\Phi(z)}=1+\sum_{j=1}^\infty\psi_j^{(p,q)}z^j$
$z=\frac{-b\pm\sqrt{b^2-4ac}}{2a}$
On the representation of ARMA processes
We have $\Psi(z)=\frac{\Theta(z)}{\Phi(z)}$
Rewrite as $\Psi(z)\Phi(z)=\Theta(z)$ with $\phi_0=\theta_0=1$, i.e.
Evaluating at $z=0$, we find $\psi_0=1$
Taking first derivative
Evaluating at $z=0$, we find $\psi_1-\phi_1\psi_0=\theta_1$, so $\psi_1=\phi_1+\theta_1$
Taking first derivative again
Evaluating at $z=0$, we find $2\psi_2-2\phi_2-2\phi_1\psi_1=2\theta_2$, so $\psi_2=\theta_2+\phi_2+\phi_1^2+\phi_1\theta_1$
Repeating the step above, we can find
Lab:
ACF/PACF Confident interval $\pm 1.96/\sqrt{T}$
Persistent : How long the shock will die out. ( long when $\theta \to 1$)
Self-Study Questions:
【CHP 6:Q8】Comment on ‘Given that the objective of any econometric modelling exercise is to find the model that most closely ‘fits’ the data, then adding more lags to an ARMA model will almost invariably lead to a better fit. Therefore a large model is best because it will fit the data more closely.’
In most financial series, there is a substantial amount of ‘noise’. This can be interpreted as a number of random events that are unlikely to be repeated in any forecastable way. We want to fit a model to the data which will be able to ‘generalise’. In other words, we want a model which fits to features of the data which will be replicated in future; we do not want to fit to sample-specific noise.
This is why we need the concept of ‘parsimony’ – fitting the smallest possible model to the data. Otherwise we may get a great fit to the data in the sample, but any use of the model for forecasts could yield terrible results.
Another important point is that the larger the number of estimated parameters (i.e., the more variables we have), then the smaller will be the number of degrees of freedom, and this will imply that coefficient standard errors will be larger than they would otherwise have been. This could lead to a loss of power in hypothesis tests, and variables that would otherwise have been significant are now insignificant.
【CHP 6:Q9】You obtain the following sample autocorrelations and partial autocorrelations for a sample of 100 observations from actual data:
Lag | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
acf | 0.420 | 0.104 | 0.032 | −0.206 | −0.138 | 0.042 |
pacf | 0.632 | 0.381 | 0.268 | 0.199 | 0.205 | 0.101 |
MA(1), the significant lag 4 acf is a typical wrinkle that one might expect with real data and should probably be ignored.
Use the Ljung–Box Q* test to determine whether the first three autocorrelation coefficients taken together are jointly significantly different from zero.
Unit 3:Forecasting with ARMA models
Reading:
- Brooks, Chapter 6, Sections 6.10.1 - 6.10.8 (Essential)
- Diebold, Chapter 6, Sections 6.7, 6.8, 7.3-7.5 (Essential)
Activities:
- Brooks, Chapter 6, Self-Study Questions 10(a)-(b); 11(a)-(c); 12(a)-(e).
Definition:
- h : forecast horizon
- Univariate information set : $\Omega_T \equiv\{X_t,t\leq T\}$
- The definition of the forecast horizon implies that we are dealing with out of sample forecasts
Forecast types
- Point Forecast: A single number.
- Interval Forecast: A range of values in which we expect the realized value of the series to fall with a given probability. The length of the interval depends on the uncertainty surrounding the point forecast.
- Density Forecast: The conditional probability distribution of the series at a give forecast horizon.
Loss Function
- $\{X_t\}^T_{t=1}$ : Series
- Point forecast : $f_{T,h}=\mathbb{E}(X_{T+h}|\Omega_T)$
- Forecast error : $e_{T,h}\equiv X_{T+h}-f_{T,h}$
- Mean Squared Error (MSE) : $\mathbb{E}[e_{T,h}]^2$
- $\mathbb{E}[X_{T+h}-\mathbb{E}(X_{T+h}|\Omega_T )]\leq \mathbb{E}[e_{T,h}]^2$
- Loss function : $l(e_{T,h})$
We will consider $Y_t=X_t-\mu, \mathbb{E}(Y_t)=0$
Forecasting based on lagged $\epsilon$’s
Consider a stationary process $\{Y_t\}$ with Wold representation
with $\psi_0=1$ and $\sum_{j=0}^\infty \psi_j^2<\infty$
Recall that for $f_{T,h}=\mathscr{P}[Y_{T+h}|\epsilon_j,j\leq T]=\sum_{j=0}^\infty\beta_j\epsilon_{T-j}$
So $\psi_h=\beta_0,\psi_{h+1}=\beta_1,….$
The optimal linear forecast takes the form
The forecast error takes the form
with $\mathbb{E}(e_{T,h})=0$ and $\mathbb{Cov}(e_{T,h},\epsilon_j)=0 for j\leq T$
We also note that $e_{T,h}\sim MA(h-1)$ process, and
that is, the MSE (risk) approaches to $\mathbb{Var}(Y_t)$ when $h \to \infty$
$Y_t$ can be rewritten as ,$Y_t=\Psi(L)\epsilon_t$ with
Divide $\Psi(L)$ by $L^h$
The annihilation operator $[·]_+$ replace the negative powers of L with a zero,
Then, optimal linear forecast can be rewritten as
Forecasting based on lagged $Y_t$’s
If $Y_t$ admits the $AR(\infty)$ representation
we can obtain the forecast of $Y_{T+h}$ as a function of $\Omega_T$
known as Wiener-Kolmogorov prediction formula.
Forecasting an AR(1) process
1-step-ahead forecast :
2-step-ahead forecast:
h-step-ahead forecast:
We refer to this approach as the chain rule of forecasting
Noting that
we have
Hence, using the Wiener-Kolmogorov prediction formula
Forecasting an AR(2) process
By backward substitution we get
The Wiener-Kolmogorov prediction formula gives
The same result can be obtained using the chain rule
Forecasting an MA(1) process
Note that
Hence,
and view $\theta\epsilon_T$ as the outcome of the infifinite recursion
Interval and Density forecasts
Let $\{Y_t\}$
be a stationary and invertible Gaussian ARMA process. Then,
By construction,
The 95% h-step-ahead interval forecast is
The h-step-ahead density forecast is $N(f_{T,h},\mathbb{Var}(e_{T,h}))$
Making the forecasts operational
- For forecasting an AR(p) process, an optimal h−step-ahead forecast based on $\Omega_T$ make uses only of the p most recent values $\{Y_T , Y_{T −1}, . . . , Y_{T −p+1}\}$.
- For an MA or ARMA process we would need to know $\Omega_T$
to use the Wiener-Kolmogorov prediction formula.
In MA(1) Process, if we assume $\theta$ is not observable
We replace the optimal linear forecast $f_{T,1}=\theta\epsilon_T$ with $\hat{f}_{T,1}=\hat\theta\epsilon_T$
The forecast error will take the form
$\mathbb{Var}(\hat{e}_{T,1})$ will also account for the variability of $\hat\theta$
MSE
MAE
Lab :
or
Self-Study Questions:
【CHP 6:Q12.a】Macroeconomic explanatory variables 对比 financial data 的劣势是,它通常是 on a quarterly or at best monthly basis. 频率低.
【CHP 6:Q12.c】如果一个模型 Pacf 联合显著不为 0,但是 Pacf 中并没有明显的截断,且 Acf 中存在明显的截断,那么可能这个现实数据并不属于 ARMA family,但我们依旧要尝试寻找最优模型.
Unit 4:ARIMA models for nonstationary time series
Reading:
- Brooks, Chapter 8, Sub-sections 8.1.1 - 8.1.4 (Essential)
- Brooks, Chapter 8, Sub-sections 8.1.5-8.1.7, Section 8.2(Further)
Activities:
- Brooks, Chapter 8, Self-Study Questions 1-3.
Definition:
To deal with such nonstationarity we begin characterizing a time series as
- $U_t$:a zero-mean stationary process
- $\mu_t=\sum_{j=0}^m\beta_jt^m$:such trend is said to be deterministic.
The deterministic trend can either be estimated or be removed by transformation
If $m=1, U_t=\epsilon_t, X_t=\beta_0+\beta_1t+\epsilon_t$
The MA process $\Delta\epsilon_t$ is stationary but not invertible
Stochastic Trends
Increasing trends alternates to decreasing trends
A process which is not stationary in mean could be modelled as an ARMA where $\Phi(z)=0$ for $|z| \leq 1$.
- These processes are characterized by stochastic trends. More later (Beveridge-Nelson Decomposition)
Random Walk
The representation
- $X_0\in\mathbb{R}$ denotes the initial value
The “memory” of the process is non-decreasing
- For $t$ large compared to $h$, $\mathbb{Corr}(X_t , X_t+h)\simeq 1$.
The 1-step ahead forecast of model (1) is
- Similarly, $f_{T,2}=X_T$
The h−step ahead forecast error is
- so that $\mathbb{Var}[e_{T,h}]=h\sigma^2$, which diverges to $\infty$ when $h\to\infty$.
Random Walk with Drift
If $X_t$ denotes the log price of a stock, the random walk hypothesis entails that
If $\Delta X_t=\mu+\epsilon_t$, then
- $\{X_t\}$ is said to be random walk with drift
Unit root processes
The random walk can be represented as
- Because $\Delta=(1-L)$, the root of the characteristic equation is one
Consider the more general example
The lag polynomial can be factorized as
The process $\{\Delta X_t\}$ satisfying
Integrated processes
A time series $\{X_t\}$is said to be integrated of order $d$, written $X_t\sim I(d)$, if $\Delta^dX_t$ is $I(0)$
- $\{X_t\}\sim I(d)$ is sometimes said difference stationary
Stochastic processes integrated of order 0
A time series satisfying $X_t-\mathbb{E}(X_t)=\sum_{j=0}^\infty\psi_j\epsilon_{t-j}$ is said to be integrated of order zero, if
Let $|\theta|<1$
- If $X_t=X_{t-1}+\epsilon_t-\theta\epsilon_{t-1}$, then $\Delta X_t=(1-\theta L)\epsilon_t$ and
Hence, $\Delta X_t$ is an $I(0)$ process
- If $X_t=\epsilon_t-\theta\epsilon_{t-1}$, then $\Delta X_t=(1-L)(1-\theta L)\epsilon_t$ and
Hence, $\Delta X_t$ is not an $I(0)$ process
ARIMA(p,d,q) process
A time series $\{X_t\}$ is called an ARIMA process of order (p,d,q), written $\{X_t\}\sim ARIMA(b,d,q)$, if
is a stationary and invertible ARMA(p,q) process.
The Beveridge-Nelson Decomposition
Let $X_T$ be an ARIMA(p,1,q) and $\Psi(L)\equiv\Phi^{-1}(L)\Theta(L)$. Then,
where
- $\mu t$ is the deterministic (linear) trend
- $\Psi(1)\sum_{j=1}^t\epsilon_j$ is the stochastic trend
- $\Psi^\star(L)=\sum_{j=0}^\infty\psi_j^\star L^j$, $\psi_j^*=-\sum_{k=j+1}^\infty\psi_k$
- $k_0$ denotes the initial condition
Characteristics of I(0) processes
- $\mathbb{Var}(X_t)$ is finite and does not depended on t.
- The innovation $\epsilon_t$ has a temporary effect on $X_t$.
- The expected length of time between crossing of $\mu$ is finite, so that $X_t$ fluctuates around its mean $\mu$.
- The autocorrelation $\rho(h)$ decreases in magnitude for large enough $h$, so their sum is finite.
Characteristics of I(1) processes
- $\mathbb{Var}(X_t)$ goes to infinite as $t$ goes to infinity.
- The innovation $\epsilon_t$ has a permanent effect on $X_t$.
- The process is not mean-reverting.
- The autocorrelation $\rho(h)\to 1$ for all $h$ as $t\to\infty$.
Unit roots testing
Dickey Fuller Test
Main idea: test of $\phi=1$ in the regression model
against the one-sided alternative $\phi<1$.
If we rewrite the regression model as
and assume that $|\phi|\leq1$ ($|\phi|>1$ ruled out),
We can test the familiar hypothesis
The standard t-test for statistically significance based on
The Augmented Dickey Fuller (ADF) Test
The tests above are only valid if $\{\epsilon_t\}\sim WN(0,\sigma^2)$ An “overly parsimonious” specifification will result in autocorrelated errors.
If $Y_t$ satisfies $(1-\sum_{j=1}^p\phi_jL^p)X_t=\epsilon_t$, the following always holds:
where $\lambda=1-\phi_1-…-\phi_p, \phi_j^*=-\sum_{i=j+1}^p\phi_i$
- P should large enough to ensure $\epsilon_t$ is White noise
Example
If we have
This series is I(1) if $(1-\phi_1-\phi_2-\phi_3)=0$
We construct the function
Problems with Unit Root tests
- Reject $I(1)$ null too often when is true, if $\Delta Y_t$ is an ARMA(p,q) with large and negative MA component.
- Low power against $I(0)$ alternatives that are close to being $I(1)$ .
- Fail to reject $I(1)$ when $Y_t$ is $I(0)$ around a trend function with a break.
Self-Study Questions:
【CHP 8:Q1.b】Why is it in general important to test for non-stationarity in time series data before attempting to build an empirical model?
If two series are non-stationary, we may experience the problem of ‘spurious’ regression. This occurs when we regress one non-stationary variable on a completely unrelated non-stationary variable, but yield a reasonably high value of R2, apparently indicating that the model fits well.
Most importantly therefore, we are not able to perform any hypothesis tests in models which inappropriately use non-stationary data since the test statistics will no longer follow the distributions which we assumed they would (e.g., t or F), so any inferences we make are likely to be invalid.
Unit 5:Conditional Heteroskedasticity
Reading:
- Brooks, Chapter 9, Sections 9.1-9.8, 9.10-9.14 (Essential)
- Diebold, Chapter 8 (Recommended)
- Robert Engle (2001), GARCH 101: The Use of ARCH/GARCH Models in Applied Econometrics (Further)
Activities:
- Brooks, Chapter 9, Self-Study Questions, Question 1 (except (f)), Question 3(a) and Question 5
Definition:
Jarque-Bera统计量,是用来检验一组样本是否能够认为来自正态总体的一种方法
Volatility models
Let $\mathbb{E}(r_t|\Omega_{t-1})\equiv \mu_t$ and define $\epsilon_t\equiv r_t-\mu_t$
where $\sigma_t$ is deterministic function of $\Omega_{t-1}$, $\sigma_t>0$, {$\eta_t$} is i.i.d.(0,1), $\sigma_t\in\Omega_{t-1}, \eta_t\perp \Omega_{t-1}$
So with $\eta_t\sim N(0,1)$, $\epsilon_t|\Omega_{t-1}\sim N(0,\sigma_t^2)$
此时$\epsilon_t$随时间变换,$\sigma_s^2$越大,正态分布越扁
Thickness of the tails is measured by the kurtosis
峰度反映了峰部的尖度
If $\sigma_t^2=\sigma^2$, then $\mathcal{K}_\epsilon=\mathcal{K}_\eta$
正态分布峰度为3, 如果残差图峰度过高,就表示存在异方差
Fat-tails can be modelled by a leptokurtic(尖峰厚尾) distribution of {$\eta_t$} and/or variability of {$\sigma_t^2$}
The ARCH(p) models
- $\mathbb{E}(\epsilon_t|\Omega_{t-1})=\sigma_t\mathbb{E}(\eta_t|\Omega_{t-1})=0$, implying that, for $s\ne t$
- $\epsilon_t,\epsilon_s$ are uncorrelated, but NOT independent!
- $\mathbb{Var}(\epsilon_t|\Omega_{t-1})=\sigma_t^2\mathbb{E}(\eta_t^2|\Omega_{t-1})=\sigma_t^2$
Volatility Clustering (波动聚类)
- Large past squared shocks $\epsilon_{t-i}^2(i>0)$imply a large conditional variance $\sigma_t^2$ for $\epsilon_t$
- The magnitude of the noise is a function of its past value
- $\mathbb{Var}(\sigma_t^2)\ne 0$, then implies that $\mathcal{K}_\epsilon>\mathcal{K}_\eta$ (Excess Kurtosis)
ARCH(1)
With $\omega,\alpha>0$
$v_t=\epsilon_t^2-\sigma_t^2=\sigma_t^2(\eta_t^2-1)$
for $\alpha<1$
If $\alpha<1$, then $\epsilon_t$ is weakly stationary.
The GARCH(p,q) models
Let
Then
and
ARMA representation of GARCH
The original equation can be rewritten as
- using the convention $\alpha_i=0 (\beta_i=0)$ if $i>p(i>q)$
把 $\sigma_t$ 的部分替换成了 $v_t$
Bollerslev (1986), Theorem 1, shows that if
then $\epsilon_t$ is weakly stationary, and {$\epsilon_t^2$} $\sim$ ARMA(max(p,q),q)
GARCH(1,1)
and the autocorrelation function of $\epsilon_t^2$ is
GARCH models are often estimated using the maximum likelihood estimator based on the Gaussian Likelihood (GMLE)
- If $\eta_t\sim NID(0,1)$, then $\epsilon_t|\Omega_{t-1}\sim N(0,\sigma_t^2)$
Specification strategy for GARCH models
“Specific-to-general” approach
Specify an adequate model for the $\mu_t=\mathbb{E}(r_t,\Omega_{t-1})$.
Test for the presence of conditional heteroskedasticity;
Select p,q and estimate GARCH models (use IC);
Evaluate the model by misspecifification tests, e.g. $\epsilon_t/\hat\sigma_t$should behave like a i.i.d. sequence.
Estimate a more suitable GARCH model (if necessary);
Extensions of the GARCH models
Drawbacks of GARCH(p,q) Models:
Non-negativity constraints may be violated.
Symmetric response to past shocks.
The leverage effect
价格大幅度下降后往往会有同样幅度价格上升的倾向
- Negative shocks appear to contribute more to stock market volatility than do positive shocks. (stylized fact )
- A negative shock to the market value of equity increases the debt/equity ratio (other things the same), increasing leverage.
- The asymmetric response of the volatility to past shocks is known as leverage effect.
The EGARCH (Exponential GARCH) Models
where
$\mathbb E|\eta|=\sqrt{2/\pi}$
- $\sigma_t^2$ will be positive (without imposing non-negativeness restrictions on the parameters)
- The asymmetry property is taken into account through $\theta$.
- The leverage effect will imply that $\theta<0$.
- Innovation of large modulus should increase the volatility, entailing that we expect
An Example
If $\eta_{t-1}<0(i.e.\epsilon_{t-1}<0)$, the variable $ln(\sigma_t^2)$ will be larger than its mean $\omega$.
If $\eta_{t-1}>0(i.e.\epsilon_{t-1}>0)$, the variable $ln(\sigma_t^2)$ will be smaller than its mean $\omega$.
Thus, we obtain the typical asymmetry property of financial time series.
The Threshold GARCH (TGARCH) model
The GJR-GARCH (Glosten, Jaganathan and Runkle) models
The asymmetry is accounted for through the $\gamma_i$
$\omega>0,(\alpha_i+\gamma_i)\ge0,\beta_j\ge0$ are sufficient for $\sigma_t^2>0$
News Impact Curves (NIC)
- The news impact curve plots $\sigma_t^2$ against $\epsilon_{t-1}$, setting $\sigma_{t-h}^2=\sigma^2$, for $h>0$
Forecasting
Consider the AR(1)-GARCH(1,1) model
with $\omega>0,\alpha,\beta\ge0,\alpha+\beta<1,|\phi|<1$
Recall that $f_{T,h}=\phi^hX_T$
and
The 1-step-ahead forecast of $\sigma_{T+1}^2$
GARCH model are deterministic volatility models.
If the parameters are known, $\sigma_{T+1}^2$ is a deterministic function of $\Omega_T$ .
The 2-step-ahead forecast
because
In general, the h-step-ahead forecast
By recursive substitutions it follows that
- If $(\alpha+\beta)<1$, $lim_{h\to\infty}f_{T,h}^{\sigma^2}=\sigma^2$, where
- If $(\alpha+\beta)=1$ (IGARCH)
Heteroskedasticity and interval forecasts
The $100 \times (1 - \alpha)\%$ interval forecast for $X_{T+h}$ takes the form
It is usual to construct interval forecasts that are symmetric around the conditional mean of $X_{T+h}$
Self-Study Questions:
【CHP 9:Q1.d】Describe two extensions to the original GARCH model. What additional characteristics of financial data might they be able to capture?
EGARCH, GJR or GARCH-M. The first two of these are designed to capture leverage effects. These are asymmetries in the response of volatility to positive or negative returns. The EGARCH model also has the added benefit that the model is expressed in terms of the log of ht, so that even if the parameters are negative, the conditional variance will always be positive. GARCH-M model allows the lagged value of the conditional variance to affect the return.
Unit 6:Multivariate Time Series
Reading:
- Brooks, Chapter 1, Section 1.7 (Essential - Week 6)
- Brooks, Chapter 7, Sections 7.10 - 7.16 (Essential - Week 7)
- Diebold, Chapter 16 (Recommended)
Activities:
- Brooks, Chapter 1, Self-Study Questions 21, 22.
Definition:
We consider n possibly related time series $X_{1t}. . . , X_{nt}$, and
define the $n$−vector
Weak Stationarity
- $\mathbb E(X_t)\equiv \mu < \infty$ for all $t$
- $\mathbb E(X_t-\mu)(X_t-\mu)’\equiv\Gamma(0)<\infty$ for all $t$
- $\mathbb E(X_t-\mu)(X_{t-h}-\mu)’\equiv\Gamma(h)$ for all $t$
The expectation of a vector
The variance-Covariance matrix
When $h=0$
When $h\ne0$
Diagonal matrix
Correlation Matrix
The multivariate white noise
Let $\epsilon_t\equiv[\epsilon_{1t},…,\epsilon_{nt}]’$
- $\mathbb E(\epsilon_t)=0$
- $\Gamma(h)=\mathbb E(\epsilon_t\epsilon_{t-h})=\begin{cases}\Sigma&for h=0\\0&for h\ne 0\end{cases}$
For individual shocks
- $\mathbb E(\epsilon_{it})=0$
- $\mathbb E(\epsilon_{it}^2)=\sigma_i^2$
- $\mathbb E(\epsilon_{it}\epsilon_{i,t-h})=0$
We do allow for contemporaneous correlation
- $\mathbb E(\epsilon_{it}\epsilon_{jt})=\sigma_{ij}$
- $\mathbb E(\epsilon_{it}\epsilon_{j,t-h})=0$
Vector Autoregressive Processes (VAR)
Let $Y_t\equiv X_t-\mu$. A VAR(p) process satisfifies the equation
and $I_n$ denotes the n−dimensional identity matrix.
【Example】
For $n = 2, p = 2$
Stationarity of a VAR
Consider the VAR(1) process
By recursively substituting
when $max_{j=1,…,n}|\lambda_j(\Phi_1)|<1$, $\Phi_1^t\to 0$ as $t\to\infty$
The previous condition can be restated as
is equivalent to
If the above conditions are satisfified
A VAR(p) process is stationary if
Remarks
The VAR(p) processes belong to the class of VARMA(p,q) processes
where
VARMA(p,q) processes are difficult to estimate when $q > 0$. VAR models are usually used in empirical applications.
Applying VAR models
Supposed that the appropriate order p for the VAR model is found, that is, a VAR(p-1) is misspecified and VAR(p+1) contains too many redundant parameters, and the parameters are estimated.
The resultant empirical model can be used for various purposes.
- Out-of-sample forecasting.
- Granger causality analysis.
- Structural analysis (impulse response function and error variance decomposition).
Forecasting
Consider the VAR(1) model. The h−step forecast at time T is given by
The forecast error covariance matrix equals
【Example】
Granger Causality
An important statistical notion of causality that’s intimately related to forecasting is based on two key principles
Cause should occur before effect.
A causal series should contain information useful for forecasting that in not available in other series.
We can say that $X$ Granger-cause $Y$ , $“X \Rightarrow Y ”$ if
It $Z_t=[X_t,Y_t]’$ can be represented as a bivariate VAR
then $X\not\Rightarrow Y$ if $\phi_{12}(L)\equiv 1+\sum_{j=1}^p\phi_{12,j}L^j=0$
The the equation for $Y_t$
then, the hypothesis $X\not\Rightarrow Y$ is equivalent to
If the VAR is stationary an F−test can be used.
Impulse Response Functions(脉冲响应函数)
To understand what are the dynamic effects of the error process $\epsilon_t$ on $Y_{t+h}$, one can calculate the so-called impulse-response function.
with $\Sigma=diag[\sigma_1^2,…,\sigma_n^2]$
Suppose there is an interest in the effect of shocks corresponding to the first variable.
One can then calculate
and
The n elements of the $V_k$ vector series, $k = 0, 1, 2, . . . , h$ are called the impulse-response functions
The i-th impulse-response function is the trajectory $\{v_{ik},k=0,1,…,h\}$
If we consider a VAR(p) stationary process
The impulse-response function is defined as
k 时间上 j 对 i 的影响
$v(i,j,k)$ represent the response of $Y_{it}$ to a unitary shock in $Y_{j,t-k}$ (produced by a unitary shock in $\epsilon_{j,t-k}$.)
$\Sigma$ is not required to be diagonal, so the component of $\epsilon_t$ may be contemporaneously correlated.
If the correlations are high, there is no way of separating the response of $Y_{it}$ to a shock on $\epsilon_{j,t−k}$ from its response to other shocks that are correlated to $\epsilon_{j,t−k}$.
If we defifine the square invertible matrix $S$ such that
then, setting $\tilde\Psi_j\equiv\Psi_jS,\xi_t=S^{-1}\epsilon_t$
让原本相关性强的残差变成白噪
We can define the structural impulse-response function as
- Note that if $\epsilon_t=S\xi_t$, then $\Sigma=SS’$
$S$ is often defined as a lower triangular matrix (Choleski decomposition)
The fact that $S$ is triangular implies that $\epsilon_{1t}$ is a function of the first structural shocks $\xi_{1t}, \epsilon_{2t}$ is a function of the structural shocks $\xi_{1t}$ and $\xi_{2t}$ and so on.
【Example】
Let $Y_t=\Phi Y_{t-1}+\epsilon_t,\epsilon_t\sim(0,\Sigma)$, with
Hence,
and so on.
Note that $\Sigma=SS’$, with
which is equivalent to $\tilde\Psi_0$. Moreover
and so on.
In practice $ \Sigma$ is estimate by
and the Cholesky decomposition is computed on $ \hat\Sigma$, that is
Forecast error variance decomposition
The orthogonalized error variance decomposition is defined as
Expressed as percentage, one can thus examine the relative importance of the error variance $Y_j$ for forecasting $Y_i$ h-step ahead.
Nonstationary VAR
Consider the n−variate VAR
The VAR process is stationary if
Recall that
where $\Phi_j^*=\sum_{k=j+1}^p\Phi_k$ and $\Pi\equiv\Phi(1)$
Beveridge-Nelson Decomposition 取 I(1)
Augmented Dickey Fuller (ADF) Test
If $\Pi$ is a matrix of zeros
If $\Pi\ne 0$ and $det(\Pi)\ne0$ the process is stationary.
If $\Pi\ne 0$ but $det(\Pi)=0$, that is Π has reduced rank, then $Y_t$ is a cointegrated process.
协整检验的目的是决定一组非平稳序列的线性组合是否具有稳定的均衡关系
If $rank(\Pi)=r<n$, then
where $\alpha$ and $\beta$ are $(n \times r)$ full column rank matrices.
Hence, the VAR admits the representation
where $\beta’Y_t$ needs to be stationary!
Unit 7:Cointegration
Reading:
- Brooks, Chapter 8, Sections 8.3 - 8.11 (Essential)
- Appendix C (Essential)
Activities:
- Brooks Chapter 8, Self-Study Questions 5,6,7
Definition:
Cointegration
An n-variate time series {$X_t$} is called a cointegrated system of order ($d,b$), written {$X_t$} $\sim CI(d, b)$, if all the components are I($d$) and there exists a $n \times r(r < n)$ matrix $\beta \ne 0$ such that $\beta ‘X_t \sim I(d − b)$, with $d \ge b > 0$. The vectors $\beta_k, k = 1, . . . , r$ are called the cointegrating vectors.
Examples of possible Cointegrating Relationships in finance:
- Spot and futures prices, spot and forward prices.
- Bid and ask prices.
- Ratio of relative prices and an exchange rate (law of one price)
- Equity prices and dividends
Market forces arising from no arbitrage conditions should ensure an equilibrium relationship
The Engle-Granger Method(协整检验)
- If the null is not rejected, $\beta=[1,-\gamma’]$ is not a cointegrating vector. (ADF Test)
Stock prices and Dividends
The dividend-price ratio $\Lambda_t$ is usually measured as
$D_t$: the sum of dividends paid on the stock/index over the previous year
$S_t$: the current price of the stock/ current level of the index
Hence,
Assume $\lambda_t=\lambda+u_t$, with $u_t\sim I(0)$
Let
which can be rewritten as
and we have
This model can be generalized to
known as an Error Correction Model (ECM)
In an ECM, the change in a variable depends on the deviations from some equilibrium relation
Let $x_t=(s_t,d_t)’$ and assume that
Then,
Consider 3 situations:
$d_{t-1}-s_{t-1}-\lambda=0$ (equilibrium)
where $\mu$ represent the growth rates of stock prices (and dividends) in the long-run.
$d_{t-1}-s_{t-1}-\lambda>0$ (positive disequilibrium error)
If $\alpha<0$, the model predicts that $d_t$ will grow more slowly than its long-run rate to restore the dividend-yield to its long-run mean.
$d_{t-1}-s_{t-1}-\lambda<0$ (negative disequilibrium error)
If $\alpha<0$, the model predicts that $d_t$ will grow faster thani ts long-run rate to restore the dividend-yield to its long-run mean.
VARs with Integrated Variables
we have
can be re-written as
$\Phi_j^*=-\sum_{k=j+1}^p\Phi_k$ and $\Pi \equiv -\Phi(1)$
$\Pi$ is know as the long-run matrix
The representation is the multivariate counterpart of the ADF regression
If $det(\Pi) = 0$ then the $VAR(p)$ is said to have at least a unit root
If $det(\Pi) = 0$, then $rank(\Pi) = r < n$.
If $r = 0$, then $\Pi = 0$ (n unit roots), and $\Delta X_t$ is a $VAR(p − 1)$.
If $0< r < n$, then
$\alpha$ and $\beta$ are $(n \times r)$ matrices.
$\beta$ is know as cointegrating vectors
The system will contain only $(n − r)$ unit roots
we obtain the Vector Error Correction Model (VECM)
Let $A$ a $(r \times r)$ non-singular matrix. Then,
Let
and assume that $\beta_1$ is invertible. To identify the matrices, the following normalization is often imposed
Testing for Cointegrating
Johansen (1988) proposed to estimated the model using the MLE, assuming the Gausssianity of {$\epsilon_t$}
MLE estimation is based upon a known value of the cointegrating rank r (usually unknown).
The maximum eigenvalue statistic, $λ_{max}$ has been proposed to the the hypothesis
The statistic $λ_{max}$ is constructed as follow:
A positive semi-definite matrix $\Xi$ having the same rank of
$\Pi$ is defined. Let $\lambda_k\equiv\lambda_k(\Xi)$, $k = 1, . . . , n$. By definitionObtain a consistent estimate $\hat{\Xi}$. Then $\lambda_k(\hat{\Xi})$ are consistent estimates of $\lambda_k, k = 1,…, n$
Test the null hypothesis $H_0 : −T ln(1 − \lambda_{r_0+1}) = 0$, for $r_0 = 0, …, n − 1$ until we fail to reject
Let $r^∗_0$ the value of $r_0$ for which the null hypothesis cannot
be rejected for the first time. Then, $\hat r = r^∗_0$
The trace statistic $λ_{trace}$ has been proposed to the the hypothesis
The statistic $λ_{trace}$ is constructed as
Test the sequence of null hypotheses $H_0 : λ_{trace}(r_0) = 0$, from $r_0 = 0$ to $r_0 = n − 1$ until we fail to reject
$r_0$ 按顺序依次取
Forecasting, Causality and IRFs
Causality Testing in VECMs
Assume $n = 2$ and $X_t = (Y_t , Z_t)’$. The VECM taks the form
where
The hypothesis that $Z$ does not Granger-cause $Y$ may be formalized as
Testing the hypothesis is not straightforward.
Alternative approaches (as the lag-augmented VAR) have been proposed.
Forecasting Integrated and Cointegrated systems
Forecasting can be discussed in the framework of the levels VAR representation.
A VECM may be rewritten in VAR form, or forecasting can be obtained directly from the VECM.
If a variable enters the system in differenced form only, it is, of course still possible to generated forecasts of the levels. See Brooks, Rew and Ritson (2001), Section 4.
Impulse Response Analysis
In principle, impulse response analysis in cointegrated systems can be conducted in the same way as for stationary systems.
Self-Study Questions:
【CHP 8:Q6.a】Suppose that a researcher has a set of three variables, she wishes to test for the existence of cointegrating relationships using the Johansen procedure. What is the implication of finding that the rank of the appropriate matrix takes on a value of 0,1,2,3
If the rank of the matrix is zero, this implies that there is no cointegration or no common stochastic trends between the series. The rank of is one or two would imply that there were one or two linearly independent cointegrating vectors or combinations of the series that would be stationary, respectively. A finding that the rank of is 3 would imply that the matrix is of full rank. The implication of a rank of 3 would be that the original series were stationary, and provided that unit root tests had been conducted on each series, this would have effectively been ruled out.
Unit 8:Forecasts Evaluation (Part I)
Reading:
- Brooks, Chapter 6, Section 6.10 (Essential)
- Diebold, Chapter 10 (Essential, Subsections 10.2.2 and 10.2.4 excluded )
Activities:
- Brooks Chapter 6, Self-Study Questions 11(d), 12(f).
- Diebold, Section 10.4, Exercises, Problems and Complements 1-3. The solutions can be found here (See Chapter 12, Exercises 1.2.6).
Definition:
Rolling vs Recursive Samples
- Rolling (M1-M12;M2-M1;M3-M2;…)
- Recursive(M1-M12;M1-M1;M1-M2;…)
Testing Properties of Optimal Forecasts
Have a zero mean (unbiasedness).
1-step-ahead optimal forecast errors are white noise.
h-step-ahead optimal forecast errors are at most MA (h − 1).
The h-step-ahead optimal forecast error variance is non-decreasing in h.
Hypothesis 1
Regress $e_{t,h}$ on a constant and use the reported t−statistic:
Multi-step-ahead forecast errors will be serially correlated because the forecast periods overlap.
- Fitting a MA(h − 1) is a good initial guess to model autocorrelation in the regression error.
- Robust standard errors can be used (e.g. Newey-West estimator).
Hypothesis 2
1-step-ahead optimal forecast errors are white noise.
- Regress the $e_{t,1}$ on a constant test the null hypothesis that the residuals are white noises
Hypothesis 3
The h-step ahead optimal forecast errors are at most MA (h − 1).
Examine the statistical signifificance of sample ACF(k), k > (h − 1) using Bartlett’s standard errors.
Regress $e_{t,h}$ on a constant, allowing for MA(q) disturbances, with q > (h − 1) and test.
Hypothesis 4
The h-step ahead optimal forecast error variance is non-decreasing in h.
Is {$e_{t,h}$} orthogonal to available information?
Mincer and Zarnowitz (1969) proposed to test partial optimality with respect to $f_{t+h,t}$ using the regressions
or, equivalently
If the forecast is optimal with respect to the information set used to construct it, then we’d expect
Ramsey (1969)’s test account for various sort of nonlinearity
If the forecast is optimal with respect to the information set used to construct it, then we’d expect
In general $\Omega_t$ doesn’t include all information available at the time the forecast was made. We will deal with forecasts that will be at most partially optimal.
Relative standards for point forecasts
Accuracy ranking via expected loss
- Forecast accuracy is measured with respect to a loss function, $\mathscr L(e_{t,h})$, and the forecast horizon h
Forecast error
Accuracy measure are defined on the forecast errors
or the percent forecast error
Mean error
- The ME measures the bias
- If ME> 0, then on average we are “under-forecasting” (“over-forecasting” if ME < 0)
- Other things the same, we prefer a forecast whose error have small bias.
Error Variance
- EV measures the dispersion of the forecast error.
- Other things the same, we prefer a forecast whose error have small variance.
- Although ME and EV are components of accuracy, neither provides an overall accuracy measure.
Mean Square Error
The most common overall accuracy measures is the mean squared error
In sample we write
Mean Square Error Decomposition
Bias-Variance trade-off: a small bias increase is acceptable in exchange for a massive variance reduction.
Root Mean Square Error
Often the square roots of MSE is used to preserve units.
Suppose that the forecast errors are measured in dollars, then the MSE is measured in dollars squared.
Taking the square root brings the unit back to dollars.
Mean Absolute Error
Somehow less popular, but nevertheless common measures are the mean absolute error
Quadratic 的 loss function 会放更多 weight 在大的值
Statistical comparison of forecast accuracy
Suppose that two different forecasts of the same object are available, say $f_{t,h}^{(1)}$ and $f_{t,h}^{(2)}$.
Suppose that $\hat{MSE}^{(1)}(h) < \hat{MSE}^{(2)}(h)$
One is tempted to conclude that $f_{t,h}^{(1)}$ “wins” under the
quadratic loss function ($\mathscr L$).The Diebold-Mariano test answers that question.
In hypothesis testing terms, we might want to test the null of equal predictive accuracy
against the alternative that one forecast is better, i.e.
against $\mathbb E(d_{12,t})\ne 0$, where $d_{12,t}=e_{t,h})^{(1)}-\mathscr L(e_{t,h})^{(2)}$
The Diebold-Mariano Test
Diebold and Mariano assume that $d_{12,t}$ is covariance stationary, i.e., for every t
and short memory, i.e. $\sum_{k=-\infty}^\infty|\gamma_d(k)|<\infty$
Under $H_0:\mathbb E(d_{12,t})=0$
where $\bar d_{12} = (N − h + 1)^{−1}\sum^{N−h}_{t=0} d_{12,t}$ has an estimated standard deviation equal to $\hat\sigma_{\bar d_{12}}$
Benchmark Comparisons: the predictive $R^2$
To assess the forecasts, one might compare a forecast to a “naive” competitor .
The Predictive $R^2$
considers $\bar Y$ as a benchmark for $f_{t,1}$.
- The (estimate of) 1-step-ahead out-of-sample forecast error variance is compared to an estimate of unconditional variance.
- The predictive $R^2$ should be close to 1 if the forecast is by far more accurate than $\bar Y$ .
- The h-step-ahead version of the predictive $R^2$ is defined replacing $e_{t,1}$ with $e_{t,h}$.
Benchmark Comparison: Theils’s U-Statistic
The Theil U-Statistic is similar to a predictive $R^2$ , but the benchmark changes from $\bar Y$ , to a “no change” forecast
Many economic variables may in fact be nearly random walk.
In this case the forecaster will have great difficult beating the random walk (RW), for which $f_{t,1} = Y_{t}$.
此方法不能运用于 GARCH,因为 Y 部分
Evaluating direction-of-change-forecasts
- In terms of profitability of a trading strategy, a forecast can be assessed in terms of the ability to predict direction changes irrespective of their magnitude.
- The accuracy of the forecast is measured in terms of % correct sign prediction.
- We can also test the null hypothesis of no predictive power.
Introduce the indicator variables
Let $P_y = Pr(Y_{t+1} > 0)$ and $P_f = Pr(f_{t,1} > 0)$. Define
Denote by $\hat P$ the proportion of times that the sign of $Y_{t+1}$ is predicted correctly, i.e.
The assessment is based on a test for the null that $Z_t^y$ and $Z_t^f$ are independent (no predictive power).
Under the null, $N\hat P$ has a binomial distribution with mean $NP_∗$, where
and
with
A nonparametric test of predictive performance of $f_{t,1}$ can be based on
如果是 iid $P(xy)=P(x)P(y)$, 即 $\hat P-\hat P_*=0$
where
and
Unit 8:Forecasts Evaluation (Part II)
Activities:
- Brooks Chapter 6, Self-Study Questions 11(d), 12(f).
- Diebold, Section 10.4, Exercises, Problems and Complements 1-3. The solutions can be found here (See Chapter 12, Exercises 1.2.6).
Definition:
Evaluating Volatility Forecasts
We can compute the forecasts $f_{t,1}^{\sigma^2}$ but we don’t observe $\sigma_{t+1}^2$
If $\epsilon_t=\sigma_t\eta_t$, $\eta_t\sim iid(0,1)$, a popular proxy is $\epsilon_{t+1}^2$, because $\mathbb E(\epsilon_{t+1}^2|\Omega_t)=\sigma_{t+1}^2$.
$\sigma_{t+1}^2$ 依附于 $\Omega_t$
Assume we have generated the series of 1-step-ahead-point forecasts $\{f_{T+j,1}^{\sigma^2}\}_{j=1}^N$
To simplify the notation we write
For example, the MSE is computed as
$\epsilon_t^2$ is an unbiased proxy of $\sigma_t^2$
Alternative volatility proxies have been proposed (see Section 9.18 Brooks).
If $\epsilon_t=\sigma_t\eta_t$, $\eta_t\sim NID(0,1)$, then
- i.e. $\epsilon_t^2<1/2\sigma_t^2$ more than fifty percent of the time.
not a good proxy
Even if one is willing to accept a proxy that is up to 50% different from $\sigma_t^2$ , $\epsilon_t^2$ would fulfil this condition only 25% of the time
25%的可能 $\epsilon^2$ 与 $\sigma^2$ 差 50% 的比
Transforming Volatility Forecasts into Probability Forecasts
Lopez (2001) proposed an alternative forecast evaluation framework.
If $\epsilon_t=\sigma_t\eta_t$, $\eta_t\sim D(0,1)$, and $\sigma_t^2$ is a predictable function of $\Omega_{t-1}$, then
volatility forecasts can be readily transformed into probability forecasts.
Out-of-sample $P_{t|t-1}$ for $t = 1, . . . , N$ is the one-step-ahead probability forecast conditional on $\Omega_{t-1}$
Probability Forecasts
Let $Y_t=\mu_t+\epsilon_t$
Suppose for example that a Central Bank is interested in forecasting whether the exchange rate ($Y_t$) will remain within a target zone
In such a case, the event of interest is
where $L_t$ and $U_t$ are fixed by the Central Bank (forecast user).
Assuming that D is continuous and $\mu_t$ and $\sigma_t$ are known
- Out aim is to forecast $Pr (L_t \leq Y_t \leq U_t)$ at time $t − 1$.
Then, the one-step-ahead probability forecast $P_t$ is computed as
Remark
To avoid cumbersome notation, we will use $X_{t|t-1}$ to denote the one-step-ahead forecast of a variable $X_t$ , conditional on $\Omega_{t-1}$.
As an example of probability forecast evaluation, we will consider the Brier score.
- The Brier score is a rough analogue of the MSE for probability forecasts.
Accuracy measures for probability forecasts are commonly called scores.
The most common is the Brier quadratic probability score
- where $R_t$ takes value one if the event occurs and zero otherwise.
$QPS\in [0, 2]$ and smaller values indicate more accurate forecasts.
Evaluating Interval Forecasts
The Lopez approach to volatility forecast evaluation is based on time-varying probabilities assigned to fixed intervals.
Alternatively, one may fix the probabilities and vary the widths of the intervals, as in the traditional confidence intervals construction.
The objective is to construct a sequence of out-of-sample interval forecasts $\{[L_{t|t-1}(\alpha),U_{t|t-1}(\alpha)]\}_{t=1}^N$.
The sequence of ex-ante forecast interval for time t made at time $t − 1$ have coverage probability $(1 − \alpha)$.
We are going to consider the approach proposed by
Christoffensen (1998)
Defifinition 1 (Indicator variable)
The indicator variable It for a given interval forecast $[L_{t|t-1}(\alpha),U_{t|t-1}(\alpha)]$ is defined as
- $I_t$ is a function of $Y_t$, so $I_t$ is a random variable
Defifinition 2 (Testing Criteria)
We say that the sequence of interval forecasts
is efficient with respect to information set $\Omega_{t-1}$, if
If $\Omega_{t−1} = \{I_{t−1}, I_{t−2}, . . . , I_1\}$, it can be shown that
is equivalent to
Defifinition 3 (Conditional coverage)
We say that a sequence of interval forecasts
has correct conditional coverage if
Standard evaluation methods for interval forecasts compare the nominal coverage
to the true coverage
The interval forecast might be correct on average, but the conditional coverage might be characterized by clustered outliers.
LR test of unconditional coverage
Consider the indicator sequence $\{I_t\}^N_{t=1}$ constructed from a given interval forecast.
To test the unconditional coverage, the hypothesis
should be tested, given independence
Let $n_j$ be the number of observations for which $I_t = j$, $j = 0, 1$. By construction $n_0 + n_1 = N$.
The likelihood under the null hypothesis is
with $p_\alpha=1-\alpha$, and under the alternative
Testing for unconditional coverage can be formulated as a likelihood ratio test,
where $\hat\pi = n_1/(n_0 + n_1)$ is the MLE of $\pi$
- The test has no power against the alternative that zeros and ones come clustered in a time-dependent fashion.
Testing for correct unconditional coverage is insufficient when dynamics are present in the higher-order moment.
Interval should be narrow in tranquil times and wide in volatile times, so that occurrences of observations outside the interval forecast would be spread out over the sample and not come in clusters.
An interval that fails to account for higher-order dynamics may be correct on average, but in any given period it will have incorrect conditional coverage characterized by clustered outliers.
The two tests presented in the next
LR test of independence
Recall that a sequence of interval forecasts has correct condition coverage if $\{I_t\} \sim^{iid} Bern(p_\alpha)$.
The “independence part” will be tested against an explicit first-order Markov alternative.
Consider a binary first-order Markov chain $\{I_t\}$, with transition probability matrix
where $\pi_{ij}=Pr(I_t=j|I_{t-1}=i)$
The likelihood function for this process is
where $n_{ij}$ is the number of observations with value $i$ followed by $j$.
The MLE(Maximum Likelihood Estimate) of $\Pi$ is
Under the null of independence, $\Pi$ simplifies to
The likelihood function under the null became
The MLE of $\pi$ is
The LR test of independence is
Next we will test jointly for independence and correct probability parameter $\pi_{ind}=p_\alpha$.
Combining the two test we get a complete test for correct conditional coverage (cc).
The Joint Test of Coverage and Independence
The main idea is to test the null of unconditional coverage against the alternative of the independence test.
Christoffersen (1998) defifine the test for correct
If the likelihood is computed conditional to the first observation (as we did), then
Example
Christoffersen (1998) evaluate the forecast methodology suggested by J.P. Morgan’s (1995) RiskMetrics using Exchange rates data for different countries.
He considers the model
The RiskMetrics interval forecasts is tested against two peers.
The interval forecast suggested by J.P Morgan is
where
Here $\lambda$ is fixed at 0.94
The distribution $D(·)$ is Gaussian, and its c.d.f. is denoted by $\Phi(·)$.
Let $c_\alpha = \Phi^{−1} (\alpha)$, then $c_\alpha$ satisfies $P(Y_{t+1}/\sigma_{t+1}\leq c_\alpha)=\alpha$
The first peer is constructed from an estimated GARCH(1,1) model with a Student’s t−innovation.
If the Student’s t−distribution has 4 degrees of freedom with c.d.f. $\tau$ , for $\alpha = 95%$ we have $\tau^{-1} (\alpha/2) = −2.776$ and $\tau^{−1}(1 − \alpha/2) = 2.776$.
To compute the interval, the unknown parameters in $\hat\sigma_{t+1|t}^2$ need to be replaced by estimates.
The second peer is a simple static forecast, constructed assuming that $D(·)$ is Gaussian.
Let $F(·)$ the unconditional, time-invariant c.d.f. of $Y_{t+1}$
Comparing interval forecasting of daily exchange rates.
- Tests for UC, IND, CC across coverage rates and exchange rates.
- Nominal coverage rate.
- Average width of the interval prediction.
- See Christoffersen (1996) Section 5 for further details