Markov-Switching-Models
In this blogpost I will discuss the underlying theory behind Markov-Switching models, this includes a short review of Markov-Chains, as well as an in-depth derivation of the Hamilton-Filter, the Kim-Algorithm (for smoothed inference) and the EM-Algorithm. The derivations and explanations are mostly taken from my Bachelor Thesis, which discussed these topics as well. The equation numbers are the equation numbers as they are in Bachelor Thesis, so you can find them easily in the original work. My thesis can be downloaded at the end of my “About Me” section and additionally discusses the development of my R package (MSARM) and how MSARM outperforms MSwM regarding estimation robustness.
Brief Remarks on Notation
We start with a short discussion of the notation utilized in this paper. The density function of a continuous random variable \(Y\) will be denoted as \(f_Y(y)\), the conditional density function, based on another random variable \(X\) will be denoted as \(f_{Y|X}(y|x)\). We want to emphasize that in this notation, the subscript denotes the arguments of the density function, and the values in the parentheses denote the point at which the density function is evaluated. Furthermore we define a generalized density function for a vector of a discrete random variable \(S\) and a continuous random variable \(Y\) as \(f_{Y,S}(y,s) = f_{Y|S}(y|s)\cdot P(S = s)\). Last but not least it should be noted that we will denote a parameter vector \(\theta\) that influences the distribution of a random variable with a semicolon after the arguments in the density function’s subscript. For probability functions, the parameter vector will also be denoted with a subscript. For example, for a continuous random variable \(Y\), we could write \(f_{Y;\theta}(y)\) or for a discrete random variable \(S\), \(P_{\theta}(S = s)\). It should be noted that both could be seen as functions in \(\theta\).
Markov-Chains and Autoregressive Processes
Markov-Chains
With the general notation out of the way, we can start to build the foundation of Markov-Switching AR models. For that, we start with Markov-Chains in general. Markov-Chains are stochastic processes satisfying the so-called Markov property. More precisely, we consider a sequence of random variables \(S_t, S_{t-1},...\) that can take values in the set \(\{1, \dots, N\}\), and fulfill the property: \[\begin{equation} P(S_t = j \mid S_{t-1} = i, \dots, S_1 = a_1) = P(S_t = j \mid S_{t-1} = i). \end{equation}\] This means that the next state is conditionally independent of all previous states given the current state. The transition probabilities between the states of such a Markov-Chain are usually summarized in a so-called transition matrix \(\Pi\), where the ith row and jth column element of \(\Pi\) is given by \((\Pi){i,j} = \pi{i,j} = P(S_t = j \mid S_{t-1} = i)\). Thereby \((\cdot)_{i,j}\) indicates the ith row and jth column element. Additionally, we want to point out that we will denote the transpose of a matrix \(A\) as \(A'\) throughout this thesis. Therefore, \(\Pi\) is defined such that the rows indicate the previous state and the columns represent the state being transitioned into. Furthermore, the sum of the row elements must equal 1. It has to be noted that we can represent a Markov-Chain as a Vector-Autoregressive-Process (VAR). The following discussion of representing Markov-Chains as a VAR closely follows Hamilton (1994, page 678-680). Suppose we have an underlying Markov-Chain \(S_t, S_{t-1},...\), and the state space is \(\{1, \dots, N\}\). We then define: \[\begin{equation} \zeta_{t} = \begin{cases} (1,0,...,0)', & \text{if } S_t = 1\\ (0,1,...,0)', & \text{if } S_t = 2\\ ...\\ (0,0,...,1)', & \text{if } S_t = N\\ \end{cases}. \end{equation}\] Thus for \(S_t = i\), \(\zeta_{t}\) equals the ith column of \(I_N\). If \(S_t = i\), then the jth element of \(\zeta_{t+1}\) is a random variable with \(P((\zeta_{t+1})_j = 1|S_t = i) = \pi_{i,j}\). Thus \[\begin{equation} E(\zeta_{t+1}|S_t = i) = \begin{pmatrix} \pi_{i,1}\\ \pi_{i,2}\\ ...\\ \pi_{i,N} \end{pmatrix}. \end{equation}\] Furthermore, one should note that \(E(\zeta_{t+1}|S_t = i)\) is the ith column of \(\Pi'\). Knowing for \(S_t = i\), \(\zeta_{t}\) is equal to the ith column of \(I_N\) it follows that \(E(\zeta_{t+1}|\zeta_{t}) = \Pi'\zeta_{t}\). Due to (1) it holds that \(E(\zeta_{t+1}|\zeta_{t},\zeta_{t-1},...) = E(\zeta_{t+1}|\zeta_{t}) = \Pi'\zeta_{t}\), therefore we can write: \[\begin{equation} \zeta_{t+1} = \Pi'\zeta_{t} + v_{t+1}; \quad \text{where} \quad v_{t+1} = \zeta_{t+1} - E(\zeta_{t+1}|\zeta_{t},\zeta_{t-1},...), \end{equation}\] (4) implicates that: \[\begin{equation} \zeta_{t+m} = (\Pi')^{m}\zeta_t + (\Pi')^{m-1}v_{t+1} + ... + \Pi'v_{t+m-1} + v_{t+m}. \end{equation}\] This is due to the following derivation: \[\begin{align*} \zeta_{t+m} &= \Pi'\zeta_{t+m-1} + v_{t+m} \\ &= \Pi'(\Pi'\zeta_{t+m-2} + v_{t+m-1}) + v_{t+m} \\ &= \Pi'(\Pi'(\Pi'\zeta_{t+m-3} + v_{t+m-2}) + v_{t+m-1}) + v_{t+m} \\ &= ... \\ &= (\Pi')^{m}\zeta_t + (\Pi')^{m-1}v_{t+1} + ... + \Pi'v_{t+m-1} + v_{t+m}. \end{align*}\]
An m-period ahead forecast for a Markov-Chain can therefore be constructed in the following way: \[\begin{equation} E(\zeta_{t+m}|\zeta_t,\zeta_{t-1},...) = (\Pi')^{m}\zeta_t. \end{equation}\] This holds because: \[\begin{align*} E(\zeta_{t+m}|\zeta_{t},\zeta_{t-1},...) &= E(\Pi'\zeta_{t+m-1} + v_{t+m}|\zeta_{t},\zeta_{t-1},...) \\ &= E((\Pi')^{m}\zeta_t + (\Pi')^{m-1}v_{t+1} + ... + \Pi'v_{t+m-1} + v_{t+m}|\zeta_{t},\zeta_{t-1},...) \\ &= (\Pi')^{m}E(\zeta_t|\zeta_{t},\zeta_{t-1},...) + (\Pi')^{m-1}E(\zeta_{t+1} - E(\zeta_{t+1}|\zeta_t,\zeta_{t-1},...)|\zeta_t,\zeta_{t-1},...) + \\ & \hspace{0.5cm}... \\ & \hspace{0.5cm}+ \Pi'E(\zeta_{t+m-1} - E(\zeta_{t+m-1}|\zeta_{t+m-2},\zeta_{t-m-3},...)|\zeta_t,\zeta_{t-1},...) \\ & \hspace{0.5cm}+ E(\zeta_{t+m} - E(\zeta_{t+m}|\zeta_{t+m-1},\zeta_{t+m-2},...)|\zeta_t,\zeta_{t-1},...) \\ &= (\Pi')^{m}\zeta_t + (\Pi')^{m-1}[E(\zeta_{t+1}|\zeta_t,\zeta_{t-1},...) - E(E(\zeta_{t+1}|\zeta_t,\zeta_{t-1},...)|\zeta_t,\zeta_{t-1},...)] + \\ & \hspace{0.5cm}... \\ & \hspace{0.5cm}+ \Pi'[E(\zeta_{t+m-1}|\zeta_t,\zeta_{t-1},...) - E(E(\zeta_{t+m-1}|\zeta_{t+m-2},\zeta_{t-m-3},...)|\zeta_t,\zeta_{t-1},...)] \\ & \hspace{0.5cm}+ E(\zeta_{t+m}|\zeta_t,\zeta_{t-1},...) - E(E(\zeta_{t+m}|\zeta_{t+m-1},\zeta_{t+m-2},...)|\zeta_t,\zeta_{t-1},...) \\ &= (\Pi')^{m}\zeta_t. \end{align*}\] We could now also condition on other random variables, like \((Y_t, Y_{t-1},...)\). We summarize these random variables in a vector \(\mathcal{Y}_t\): \[\begin{equation} \mathcal{Y}_t = (Y_t,Y_{t-1},...), \end{equation}\] the realisation of \(\mathcal{Y}_t\) will be denoted as: \[\begin{equation} \vec{y}_t = (y_t,y_{t-1},...). \end{equation}\] Furthmore we define: \[\begin{equation} \mathcal{Y}_{t:\tau} = (Y_t,Y_{t-1},...,Y_{\tau}), \end{equation}\] the realisation of \(\mathcal{Y}_{t:\tau}\) will be denoted as: \[\begin{equation} \vec{y}_{t:\tau} = (y_t,y_{t-1},...,y_{\tau}). \end{equation}\] Generally speaking, \(\mathcal{Y}_T\) will be the total time series of interest and \(\vec{y}_T\) the observed realization. If the process is governed by regime \(S_t = j\) at date \(t\) then the conditional density of \(Y_t\) is assumed to be given by \(f_{Y_t|S_t,\mathcal{Y}_{t-1};\alpha}(y_t|j,\vec{y}_{t-1})\). Thereby \(\alpha\) is a vector of parameters characterizing the conditional density function. Furthermore it is assumed that the conditional density depends only on the current regime \(S_t\) and not on past regimes, to be more precise it shall hold: \[\begin{equation} f_{Y_t|S_t,\mathcal{Y}_{t-1};\alpha}(y_t|s_t,\vec{y}_{t-1}) = f_{Y_t|S_t,S_{t-1},...,\mathcal{Y}_{t-1};\alpha}(y_t|s_t,s_{t-1},...,\vec{y}_{t-1}). \end{equation}\] Additionally, \(S_{t+m}\) shall be conditonally independent of \(\mathcal{Y}_t\) given \(S_t\), for \(m \geq 1\), therefore it shall hold that: \[\begin{equation} P_{\theta}(S_{t+m} = j|S_t = i) = P_{\theta}(S_{t+m} = j|S_t = i, \mathcal{Y}_t = \vec{y}_t). \end{equation}\] One could now condition on this random vector \(\mathcal{Y}_t\), given a parameter vector \(\theta\), which includes the transition probabilities of the Markov-Chain, as well as the parameters of the distribution of \(Y_t\), therefore \(\theta = (\Pi,\alpha)\). It should be noted that \(\Pi\) has to be understood in this notation, as part of \(\theta\), as the vector of transition probabilities, instead of the matrix of transition probabilities. Still, we believe that this notation improves the readability and understanding of what \(\theta\) is compared to other notations. We can now write: \[\begin{equation} \hat{\zeta}_{t|t} = E_{\theta}(\zeta_{t}|\mathcal{Y}_t = \vec{y}_t) = \begin{pmatrix} P_{\theta}(S_t = 1|\mathcal{Y}_t = \vec{y}_t)\\ P_{\theta}(S_t = 2|\mathcal{Y}_t = \vec{y}_t)\\ ...\\ P_{\theta}(S_t = N|\mathcal{Y}_t = \vec{y}_t) \end{pmatrix}. \end{equation}\] We now want to estimate \(\zeta_{t+m}\) with \(\hat{\zeta}_{t+m|t} = E_{\theta}(\zeta_{t+m}|\mathcal{Y}_t = \vec{y}_t)\). From earlier we know that: \[\begin{align*} \displaystyle (E_{\theta}(\zeta_{t+m}|\mathcal{Y}_t = \vec{y}_t))_j &= P_{\theta}(S_{t+m} = j|\mathcal{Y}_t = \vec{y}_t) \\&= \sum_{i = 1}^{N}P_{\theta}(S_{t+m} = j,S_t = i|\mathcal{Y}_t = \vec{y}_t) \\&= \sum_{i = 1}^{N}P_{\theta}(S_{t+m} = j|S_t = i,\mathcal{Y}_t = \vec{y}_t) P_{\theta}(S_{t} = i|\mathcal{Y}_t = \vec{y}_t) \\ &= \sum_{i = 1}^{N}P_{\theta}(S_{t+m} = j|S_t = i)P_{\theta}(S_t = i|\mathcal{Y}_t = \vec{y}_t) \\ &= ((\Pi')^m)_{j,}\hat{\zeta}_{t|t}. \end{align*}\] Applying this to all elements we end up with: \[\begin{equation} E_{\theta}(\zeta_{t+m}|\mathcal{Y}_t = \vec{y}_t) = (\Pi')^m\hat{\zeta}_{t|t}, \end{equation}\] compare Hamilton (1994, page 693). This property will be essential later, therefore it is important to keep in mind that this indeed holds true.
Autoregressive (AR) Processes
Working with time series can be tricky, as one always deals with stochastic processes. The idea is the following, if one observes a time series \(\vec{y}_T = (y_1,...,y_T)\), then the observations are the realizations of the random variables \(\mathcal{Y}_T = (Y_1,...,Y_T)\). These random variables, are connected to each other, since they all stem from the same underlying stochastic process. The goal is now to estimate said stochastic process. Before estimating the parameters of a process, it is necessary to decide which process to assume as the underlying (or at least sufficiently similar) process. A rather often utilized process-family are AR processes, an AR(m) has the form: \[\begin{align*} Y_t = c + \phi_1Y_{t-1} + ... + \phi_mY_{t-m} + U_t; \quad \text{where}\quad U_t \sim WN(0,\sigma^2). \end{align*}\] Thereby \(WN(0,\sigma^2)\) denotes a zero-mean white noise process with variance \(\sigma^2\). The white noise distribution that is most often used is the normal distribution. Therefore \[\begin{align*} Y_t = c + \phi_1Y_{t-1} + ... + \phi_mY_{t-m} + U_t; \quad \text{where}\quad U_t \sim N(0,\sigma^2), \end{align*}\] would qualify as an AR(m). This framework of an AR(m) with gaussian white noise will be the foundation on which Markov-Switching AR models are built in the next section.
#Markov-Switching Models (MSM)
Introduction to Markov-Switching Models
The basic idea of Markov-Switching models is that the stochastic process \(Y_1,..., Y_T\) is itself influenced by another, underlying stochastic process, in this specific case by an underlying Markov-Chain. Therefore, a Markov-Switching AR(1) could take the following form: \[\begin{equation} Y_t = c_{s_t} + \phi_{s_t}Y_{t-1} + U_{t}; \quad \text{where} \quad U_{t} \overset{i.i.d.}{\sim} N(0,\sigma^2). \end{equation}\] We could alternatively write: \[\begin{align*} Y_t = X_t'\beta_{s_t} + U_t \quad \text{with} \quad X_t = \begin{pmatrix} 1 \\Y_{t-1} \end{pmatrix} \quad \text{and} \quad \beta_{s_t} = \begin{pmatrix} c_{s_t} \\\phi_{s_t} \end{pmatrix}. \end{align*}\] Here \(S_t\) follows a first-order Markov-Chain and \(s_t\) is the value of the Markov-Chain at time \(t\). Furthermore, important assumptions are that (11) and (12) hold true, that there is a maximum lag order (in this case 1) as well as that \(U_t\) shall be conditionally independent of \(S_{t-1},S_{t-2},...\) given \(S_t\) and that \(S_{t+m}\) shall be conditionally independent of \(U_t,U_{t-1},...\) given \(S_t\), for all \(m \geq 1\). To put it more formally, it shall hold that: \[\begin{equation} f_{U_t|S_t;\alpha}(u_t|s_t) = f_{U_t|S_t,S_{t-1},...;\alpha}(u_t|s_t, s_{t-1},...), \end{equation}\] \[\begin{equation} P_{\theta}(S_{t+m} = s_{t+m}|S_t =s_t) = P_{\theta}(S_{t+m} = s_{t+m}|S_t =s_t, U_t = u_t, U_{t-1} = u_{t-1},...). \end{equation}\] Additionally, a vector of the form of (7) would represent the vector of all observable variables until \(t\). It has to be emphasized that the density of \(Y_t\) conditioned on \(S_t = s_t\) and \(\mathcal{Y}_{t-1} = \vec{y}_{t-1}\) has the following form: \[\begin{equation} \displaystyle f_{Y_t|S_t,\mathcal{Y}_{t-1};\alpha}(y_t|s_t,\vec{y}_{t-1}) = \cfrac{1}{\sqrt{2\pi}\sigma}\exp\left(\cfrac{-(y_t - c_{s_t} - \phi_{s_t}y_{t-1})^2}{2\sigma^2} \right). \end{equation}\] Here we can also see that (11) is indeed true for Markov-Switching AR(m) models with gaussian white noise, as long as earlier states of the Markov-Chain than the current state, \(s_t\) don’t influence the parameters that describe the generation of \(Y_t\), i.e the intercept, the coefficients and the error-term variance. For this specific AR(1) \(\alpha\) would consist of \(c_1,...,c_N,\phi_1,...,\phi_N\) and \(\sigma^2\). We summarize the values of the conditional density functions for all potential states of the Markov-Chain in the following vector: \[\begin{equation} \eta_t = \begin{pmatrix} f_{Y_t|S_t,\mathcal{Y}_{t-1}; \alpha}(y_t|1,\vec{y}_{t-1})\\ ...\\ f_{Y_t|S_t, \mathcal{Y}_{t-1}; \alpha}(y_t|N,\vec{y}_{t-1}) \end{pmatrix}. \end{equation}\]
Now that this model class has been introduced, we want to further investigate the question of optimal inference regarding the states of the Markov-Chain, often called regimes. In the following sections, the goal is to estimate the parameter vector \(\theta = (\Pi,\alpha)\) given \(\mathcal{Y}_t = \vec{y}_{t}\).
Optimal Inference of the Regimes and Derivation of the Log-Likelihood
But before we follow this endeavour, we peak into a world where we assume that \(\theta\) is known. Given \(\theta\) we want to get inference regarding the regimes of the time series. We start by summarizing the \(P_{\theta}(S_t = j|\mathcal{Y}_t = \vec{y}_{t})\) and \(P_{\theta}(S_{t+1} = j|\mathcal{Y}_t = \vec{y}_{t})\) for all \(j = 1,...,N\), similarily to (13) in: \[\begin{equation} \hat{\zeta}_{t|t} = \begin{pmatrix} P_{\theta}(S_t = 1|\mathcal{Y}_t = \vec{y}_{t})\\ ...\\ P_{\theta}(S_t = N|\mathcal{Y}_t = \vec{y}_{t})\\ \end{pmatrix} \quad \text{and} \quad \hat{\zeta}_{t+1|t} = \begin{pmatrix} P_{\theta}(S_{t+1} = 1|\mathcal{Y}_t = \vec{y}_{t})\\ ...\\ P_{\theta}(S_{t+1} = N|\mathcal{Y}_t = \vec{y}_{t})\\ \end{pmatrix}. \end{equation}\] The claim Hamilton makes now is that the optimal inference can be derived by iterating over the following equations: \[\begin{equation} \displaystyle \hat{\zeta}_{t|t} = \cfrac{(\hat{\zeta}_{t|t-1} \odot \eta_t)}{\mathbf{1'}(\hat{\zeta}_{t|t-1} \odot \eta_t)}, \end{equation}\] \[\begin{equation} \displaystyle \hat{\zeta}_{t+1|t} = \Pi'\hat{\zeta}_{t|t}. \end{equation}\] Where \(\odot\) denotes element-by-element multiplication. In addition, the value of the Log-Likelihood function for the vector of all observations \(\vec{y}_{T}\) at the point \(\theta\) is gained as a side product: \[\begin{equation} \displaystyle \mathcal{L}(\theta) = \sum_{t=1}^{T}\ln(f_{Y_t|\mathcal{Y}_{t-1};\theta}(y_t|\vec{y}_{t-1})), \end{equation}\] \[\begin{equation} \displaystyle f_{Y_t|\mathcal{Y}_{t-1};\theta}(y_t|\vec{y}_{t-1}) = \mathbf{1'}(\hat{\zeta}_{t|t-1} \odot \eta_t), \end{equation}\] see Hamilton (1994, page 692). That this is indeed true is shown in the Appendix, section 9.1. Based on this system of two equations and a given \(\theta\) we start with a random \(\hat{\zeta}_{1|0}\) and iterate over all t until we reach T (T is the number of periods for which observations of the time series exist). This gives us the regime probabilities conditionally on the data until t. Additionally we get the Log-Likelihood function, which can be optimized in \(\theta\) to derive the Maximum Likelihood estimate of \(\theta\). It is important to note that a direct optimization of \(\mathcal{L}(\theta)\) can be computationally expensive and often leads to suboptimal results. Therefore, Hamilton introduced, in his paper “Analysis of Time Series subject to Changes in Regime” from 1990, an iterative optimization algorithm for \(\mathcal{L}(\theta)\), which is an application of the EM algorithm. Applying a specific variant of the EM algorithm to this optimization problem, instead of more general optimization algorithms can lead to better and computationally less expensive results, see Hamilton (1990, page 40-41). The details of this specific application of the EM algorithm for the optimization of \(\mathcal{L}(\theta)\) are shown throughout the following sections, this includes a derivation of an algorithm developed by Kim (1994), which can improve some aspects of the application of the EM algorithm as shown in Hamilton (1990). The derivation shown of the “Kim-Algorithm” is based on Hamilton (1994, page 700-702).
Smoothed Inference over the Regimes
Before further investigating the optimization of the log-likelihood and the associated parameter estimation, we want to first discuss how to get inference on \(P_{\theta}(S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T})\), as this will be essential for the application of the aforementioned EM algorithm. Estimating \(P_{\theta}(S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T})\) is often called “smoothed inference” for the regimes. To get this smoothed inference, we apply the algorithm from Kim (1994). The proposition is that, assuming \(S_t\) follows a first order Markov-Chain and that the conditional density, \(f_{Y_t|S_t,\mathcal{Y}_{t-1};\alpha}(y_t|j,\vec{y}_{t-1})\), depends only on \(S_t,S_{t-1},...\) through \(S_t\), i.e. that (11) holds true, an assumption we have already made throughout section 3.1, it shall hold that: \[\begin{equation} \displaystyle \hat{\zeta}_{t|T} = \hat{\zeta}_{t|t} \odot (\Pi(\hat{\zeta}_{t+1|T}(\div)\hat{\zeta}_{t+1|t})), \end{equation}\] where \((\div)\) is the symbol for element-by-element division, see Hamilton (1994, page 694). To get the values of \(P_{\theta}(S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T})\) for \(t\) in \(1,...,T\) one starts with \(t = T-1\) and iterates backwards. The derivation can be found in the Appendix, section 9.2.
Optimisation of the Conditional Log-Likelihood
General EM Algorithm Theory
We now turn to the EM algorithm. Assuming we observe \(\vec{y}_{T} = (y_1,...,y_T)\), a trick that is often utilized is to optimize a density function of the form \(f_{Y_T,...,Y_{m+1}|Y_{m},...,Y_1;\lambda}(y_T,...,y_{m+1}|y_m,...,y_1)\) in \(\lambda\) instead of \(f_{Y_T,...,Y_1;\theta}(y_T,...,y_1)\) in \(\theta\). We have to optimize in \(\lambda\) because if we choose such a conditional likelihood function, then we have to make assumptions about how the initial states \((Y_{m},...,Y_1)\) are distributed. The simplest approach is to assume that they are seperately drawn from a distribution with the parameters \(\rho\). Thereby \(\rho\) shall be unrelated of \(\Pi\) and \(\alpha\). The new parameter vector \(\lambda\) is therefore defined as \(\lambda = (\Pi,\alpha,\rho)\). The conditional likelihood function is primarily chosen due to practical reasons. Optimizing the likelihood function instead is often more challenging and yields next to no additional gain. Choosing the conditional likelihood enables the application of the EM algorithm, as described by Hamilton (1990), which estimates the parameters with relatively low computational demands, at least compared to other numerical methods, see Hamilton (1990, page 40). Still it has to be emphasized that the EM algorithm that Hamilton introduces only leads to a local maximum of the conditional likelihood function, as will be shown throughout this section. Generally speaking, this is not problematic, as one can start the algorithm with several different values to see whether the results are robust. We start by defining \[\begin{equation} P_{\lambda}(S_m = s_m,S_{m-1} = s_{m-1},...,S_1 = s_1|Y_m = y_m,...,Y_1 = y_1) = \rho_{s_m,...,s_1}, \end{equation}\] and \[\begin{equation} \rho = (\rho_{1,1,..,1},\rho_{1,1,..,2},...,\rho_{N,N,..,N}). \end{equation}\] Thereby (26) is the vector of population probabilities, which are aggregated in (27). With this we can now derive the general EM algorithm: We assume we know nothing about \(\lambda\) and it should be chosen such that the conditional likelihood \[\begin{equation} f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_{m};\lambda}(\vec{y}_{T:(m+1)}|\vec{y}_{m}) = f_{Y_T,...,Y_{m+1}|Y_m,...,Y_1;\lambda}(y_T,...,y_{m+1}|y_m,...,y_1), \end{equation}\] is maximized, the optimising \(\lambda\) is called \(\lambda_{MLE}\). Furthermore we define \[\begin{equation} \displaystyle \mathcal{S} = (S_T,S_{T-1},...,S_1), \end{equation}\] \[\begin{equation} \displaystyle \vec{s}_T = (s_T,s_{T-1},...,s_1), \end{equation}\] \[\begin{equation} \displaystyle Z_t = (S_t,S_{t-1},...,S_{t-m},Y_{t-1},...,Y_{t-m}), \end{equation}\] \[\begin{equation} \displaystyle z_t = (s_t,s_{t-1},...,s_{t-m},y_{t-1},...,y_{t-m}), \end{equation}\] \[\begin{equation} \displaystyle f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) = f_{Y_T,...,Y_{m+1},S_T,...,S_1|Y_m,...,Y_1;\lambda}(y_T,...,y_{m+1},s_T,...,s_1|y_m,...,y_1), \end{equation}\] \[\begin{equation} \displaystyle \sum_{\vec{s}_T} P(\mathcal{S} = \vec{s}_T) \, = \sum_{s_T = 1}^{N}\cdots\sum_{s_1 = 1}^{N}P(S_T = s_T,...,S_1 = s_1), \end{equation}\] \[\begin{equation} \displaystyle f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_{m};\lambda}(\vec{y}_{T:(m+1)}|\vec{y}_{m}) = \sum_{\vec{s}_T} f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m), \end{equation}\] and \[\begin{equation} \displaystyle Q_{\lambda_l,\vec{y}_T}(\lambda_{l+1}) = \sum_{\vec{s}_T} \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m). \end{equation}\] Where we call \(Q_{\lambda_l,\vec{y}_T}(\lambda_{l+1})\) the expected log-likelihood. With that remark we finish the necessary notation, which is based on Hamilton (1990, page 42-44) and Hamilton (1990, page 46-47). Now we want to show that the EM algorithm works. It is noteworthy that the EM algorithm can be seen from two different perspectives.
First we want to discuss the EM algorithm as a sequence of optimization problems. \(\hat{\lambda}_l\) is the result of the lth optimization problem, and we start with a random \(\hat{\lambda}_0\) for the first optimization problem. We choose \(\hat{\lambda}_{l+1}\) such that \(Q_{\hat{\lambda}_l,\vec{y}_T}(\lambda_{l+1})\) is maximized. We remember: \[\begin{align*} \displaystyle Q_{\hat{\lambda}_l,\vec{y}_T}(\lambda_{l+1}) = \sum_{\vec{s}_T} \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_l}(\vec{y}_{T:(m+1)}|\vec{y}_m). \end{align*}\] Therefore \(\hat{\lambda}_{l+1}\) fulfills: \[\begin{equation} \displaystyle \sum_{\vec{s}_T} \cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial\lambda_{l+1}}\Big|_{\lambda_{l+1} = \hat{\lambda}_{l+1}} \cdot f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_l}(\vec{y}_{T:(m+1)}|\vec{y}_m) = 0. \end{equation}\] It holds that \(\hat{\lambda}_{l+1}\) is associated with an higher value of the conditional likelihood function than \(\hat{\lambda}_l\) i.e. \(f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_{l+1}}(\vec{y}_{T:(m+1)}|\vec{y}_m) \geq f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)\). In the following, we closely follow the derivation in Hamilton (1990, page 48-49). Per construction \(\hat{\lambda}_{l+1}\) maximizes \(Q_{\hat{\lambda}_l,\vec{y}_T}(\lambda_{l+1})\), thus: \[\begin{align*} Q_{\hat{\lambda}_l,\vec{y}_T}(\hat{\lambda}_{l+1}) \geq Q_{\hat{\lambda}_l,\vec{y}_T}(\hat{\lambda}_{l}) ; \quad \text{equal if} \quad \hat{\lambda}_{l+1} = \hat{\lambda}_l. \end{align*}\] We also note that \(\forall x \in \mathbb{R}^{+}: \ln(x) \leq (x-1)\). This is because we can show that \(h(x) = x - 1 - \ln(x)\) has a minimum at \(x = 1\) and \(h(1)\) = 0. The first order condition is given by: \[\begin{align*} h'(x) &= 1 - \cfrac{1}{x} = 0 \\& \Leftrightarrow x = 1. \end{align*}\] That this is indeed a minimum can be shown by checking the second derivative at the point \(x = 1\): \[\begin{align*} h''(1) = \cfrac{1}{1^2} > 0, \end{align*}\] thus \(h(x)\) has a minimum at the point \(x = 1\) and \(\forall x \in \mathbb{R}^{+}: \ln(x) \leq (x-1)\) is therefore a true statement. We can apply this now and show: \[\begin{align*} \displaystyle 0 &\leq Q_{\hat{\lambda}_l,\vec{y}_T}(\hat{\lambda}_{l+1}) - Q_{\hat{\lambda}_l, \vec{y}_T}(\hat{\lambda}_l) \\&= \sum_{\vec{s}_T}\ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\& \hspace{0.5cm} - \sum_{\vec{s}_T}\ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &= \sum_{\vec{s}_T} \ln\left[\cfrac{f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}\right]f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &\leq \sum_{\vec{s}_T} \left[\cfrac{f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)} - 1\right]f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &= \sum_{\vec{s}_T} (f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)-f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)) \\ &= f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_{l+1}}(\vec{y}_{T:(m+1)}|\vec{y}_m)-f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m). \end{align*}\] With that we have shown that the algorithm indeed leads to an increase of the conditional likelihood function with each step. Now we want to show that if \(\hat{\lambda}_{l+1} = \hat{\lambda}_{l}\), then the first order condition for maximizing \(f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_{m};\lambda}(\vec{y}_{T:(m+1)}|\vec{y}_{m})\) is fulfilled by \(\lambda = \hat{\lambda}_l\). This is the case because if: \[\begin{align*} \displaystyle \cfrac{\partial Q_{\hat{\lambda}_l,\vec{y}_T}(\lambda_{l+1})}{\partial \lambda_{l+1}}\Big|_{\lambda_{l+1} = \hat{\lambda}_l} = 0. \end{align*}\] Then: \[\begin{align*} \displaystyle \cfrac{\partial f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_{m};\lambda}(\vec{y}_{T:(m+1)}|\vec{y}_{m})}{\partial \lambda}\Big|_{\lambda = \hat{\lambda}_l} = 0, \end{align*}\] this holds true because: \[\begin{align*} \displaystyle \cfrac{\partial Q_{\hat{\lambda}_l,\vec{y}_T}(\lambda_{l+1})}{\partial \lambda_{l+1}}\Big|_{\lambda_{l+1} = \hat{\lambda}_l} &= \sum_{\vec{s}_T} \left(\cfrac{1}{f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)} \cfrac{\partial f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{\partial \lambda_{l+1}}\right)\Big|_{\lambda_{l+1} = \hat{\lambda}_l} \\& \hspace{0.5cm} \cdot f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\hat{\lambda}_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &= \sum_{\vec{s}_T} \cfrac{\partial f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{\partial \lambda_{l+1}}\Big|_{\lambda_{l+1} = \hat{\lambda}_l} \\ &= \cfrac{\partial f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)}|\vec{y}_m)}{\partial \lambda_{l+1}}\Big|_{\lambda_{l+1} = \hat{\lambda}_l}. \end{align*}\] With that in mind, we proceed to consider the second perspective. One could say that the EM algorithm replaces the unobserved scores with their conditional expectation. Assuming we know \(\vec{s}_T\), then \(\hat{\lambda}_{MLE}(\vec{s}_T)\) is characterized by the first-order condition: \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial \lambda}\Big|_{\lambda = \hat{\lambda}_{MLE}(\vec{s}_T)} = 0. \end{align*}\] Even so we have no data regarding \(\mathcal{S}\) we still have inference regarding \(\mathcal{S}\), at least for the lth step, based on \(\hat{\lambda}_l\) and \(\mathcal{Y}_T = \vec{y}_T\). We can formulate: \[\begin{equation*} \displaystyle P_{\hat{\lambda}_l}(\mathcal{S} = \vec{s}_T|\mathcal{Y}_T = \vec{y}_T) = \cfrac{f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_{m};\hat{\lambda}_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)}. \end{equation*}\] For all possible values of \(\mathcal{S}\), which amounts to \(N^T\) possibilities, there exists such a first-order condition. If one weights all these FOCs with the probability \(P_{\hat{\lambda}_l}(\mathcal{S} = \vec{s}_T|\mathcal{Y}_T = \vec{y}_T)\) then we choose \(\lambda\) such that: \[\begin{align*} \displaystyle & \sum_{\vec{s}_T} \cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial \lambda} \cdot \cfrac{f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_{m};\hat{\lambda}_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)} = 0 \\ &\Leftrightarrow \cfrac{1}{f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\hat{\lambda}_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)}\cfrac{\partial Q_{\hat{\lambda}_l,\vec{y}_T}(\lambda)}{\partial \lambda} = 0 \\ &\Leftrightarrow \cfrac{\partial Q(\lambda,\hat{\lambda}_l,\vec{y}_T)}{\partial \lambda} = 0. \end{align*}\] This again is the characterizing condition for \(\hat{\lambda}_{l+1}\) in the first perspective!
Application of the EM Algorithm to Markov-Switching AR Models
The next essential step in the theoretical exposition is to demonstrate the application of the EM algorithm to models of the type we are using, that is, models with an underlying Markov-Chain and a dependence on a maximum lag order. Hamilton states that, when using the EM algorithm to maximize the conditional log-likelihood, one obtains three equations to iterate over, see Hamilton (1990, page 51). The equations given by Hamilton are: \[\begin{equation} \displaystyle \pi_{i,j}^{(l+1)} = \cfrac{\sum_{t = m+1}^{T}P_{\lambda_l}(S_t = j,S_{t-1} = i|\mathcal{Y}_T = \vec{y}_T)}{\sum_{t = m+1}^{T}P_{\lambda_l}(S_{t-1} = i|\mathcal{Y}_T = \vec{y}_T)}, \end{equation}\] \[\begin{equation} \displaystyle \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \alpha}\Big|_{\alpha = \alpha_{l+1}} \cdot P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0, \end{equation}\] \[\begin{equation} \displaystyle \rho_{i_m,...,i_1}^{(l+1)} = P_{\lambda_l}(S_m = i_m,...,S_1 = i_1|\mathcal{Y}_T = \vec{y}_T). \end{equation}\]
We show that these equations are indeed true in the Appendix, section 9.3. Given this exposition one can apply the EM algorithm to a very broad class of models, even broader than just Markov-Switching AR models, but to actually estimate a specific model, one has to specify the assumed process in more detail. Hamilton (1990) shows only one potential setup for Markov-Switching AR models, we will call this setup “Example 0” throughout this paper. Deriving the results for Example 0 will be the subject of the next section. After this is done we will show 5 more examples, establishing broader AR setups, until the theory for estimating any potential Markov-Switching AR model has been shown. Thereby, Example 1 to Example 3 are specific examples that are introduced to improve the readability of the general cases, Example 4 and Example 5.
Example 0: Switching Coefficients and Intercept, Non-Switching \(\sigma^2\)
Here we assume that the underlying process is given by a Markov-Switching AR(m), which fulfills all assumptions formulated in section 4.1, the only difference is that the assumed underlying AR process now has the following form. \[\begin{equation} \displaystyle Y_t = c_{s_t}+\phi_{1,s_t}Y_{t-1} +...+ \phi_{m,s_t}Y_{t-m} + U_t \quad \text{where} \quad U_{t} \overset{i.i.d.}{\sim} N(0,\sigma^2). \end{equation}\] We could alternatively write: \[\begin{align*} Y_t = X_t'\beta_{s_t} + U_t \quad \text{with} \quad X_t = \begin{pmatrix} 1 \\Y_{t-1} \\... \\Y_{t-m} \end{pmatrix} \quad \text{and} \quad \beta_{s_t} = \begin{pmatrix} c_{s_t} \\\phi_{1,s_t} \\... \\\phi_{m,s_t} \end{pmatrix}. \end{align*}\] The following derivation closely follows Hamilton (1990, page 56-58). First, we see that: \[\begin{align*} \displaystyle f_{Y_t|Z_t;\alpha}(y_t|z_t) = \cfrac{1}{\sqrt{2\pi}\sigma} \exp\left(\cfrac{-(y_t-x_t'\beta_{s_t})^2}{2\sigma^2}\right). \end{align*}\] It is important to note that \(x_t\) denotes the vector of realizations for \(X_t\). Then it holds that: \[\begin{equation} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_j} = \begin{cases} \cfrac{(y_t - x_t'\beta_j)x_t}{\sigma^2}, & \text{if $S_t = j$}\\ 0, & \text{otherwise} \end{cases}, \end{equation}\] \[\begin{equation} \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \sigma^{-2}} = \cfrac{\sigma^2}{2} - \cfrac{(y_t -x_t'\beta_{s_t})^2}{2}. \end{equation}\] (43) is indeed true, because: \[\begin{align*} \displaystyle \ln\left[\frac{1}{\sqrt{2\pi}\sigma} \exp\left(\frac{-(y_t-x_t'\beta_{s_t})^2}{2\sigma^2}\right)\right] &= \ln\left(\frac{1}{\sqrt{2\pi}\sigma}\right)-\frac{1}{2}\left(\frac{(y_t-x_t'\beta_{s_t})^2}{\sigma^2}\right) \\ &= \ln(1) - \ln(\sqrt{2\pi}\sigma) - \cfrac{1}{2}(y_t - x_t'\beta_{s_t})^2\cfrac{1}{\sigma^2} \\ &= -\ln(\sqrt{2\pi}) -\ln(\sigma) - \cfrac{1}{2}(y_t-x_t'\beta_{s_t})^2\cfrac{1}{\sigma^2} \\ &= -\ln(\sqrt{2\pi}) - \ln\left(\left(\frac{1}{\sigma^2}\right)^{-\frac{1}{2}}\right) - \cfrac{1}{2}(y_t-x_t'\beta_{s_t})^2\cfrac{1}{\sigma^2}. \end{align*}\] And thus: \[\begin{align*} \cfrac{\partial \ln\left[\frac{1}{\sqrt{2\pi}\sigma} \exp\left(\frac{-(y_t-x_t'\beta_{s_t})^2}{2\sigma^2}\right)\right]}{\partial \left(\frac{1}{\sigma^2}\right)} &= \left(\frac{1}{\sigma^2}\right)^{\frac{1}{2}}\left(\frac{1}{2}\left(\frac{1}{\sigma^2}\right)^{-\frac{3}{2}}\right)-\frac{1}{2}(y_t-x_t'\beta_{s_t})^2 \\ &= \cfrac{\sigma^2}{2} - \cfrac{(y_t-x_t'\beta_{s_t})^2}{2}. \end{align*}\] We insert these results into (39), which leads to: \[\begin{equation} \displaystyle \sum_{t = m+1}^{T}\cfrac{(y_t-x_t'\beta_j^{(l+1)})x_t}{\sigma_{(l+1)}^2} P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0, \end{equation}\] and \[\begin{equation} \displaystyle \sigma_{(l+1)}^2 = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cfrac{(y_t\sqrt{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)} - x_t'\sqrt{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}\beta_{s_t}^{(l+1)})^2}{(T-m)}. \end{equation}\]
(44) is true because (42) can be understood as a function of \(S_t\), we could write \[\begin{align*} \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_j} = g_{j}(S_t) = \begin{cases} \cfrac{(y_t - x_t'\beta_j)x_t}{\sigma^2}, & \text{if $S_t = j$}\\ 0, & \text{otherwise} \end{cases}, \end{align*}\] thus we can write: \[\begin{align*} & \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\sum_{s_{t-1} = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_{j}}\Big|_{\alpha = \alpha^{(l+1)}} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\ & \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\sum_{s_{t-1} = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N} g_j(s_t) P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\ & \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}g_j(s_t)\sum_{s_{t-1} = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\ & \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}g_j(s_t) P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = 0 \\ & \Leftrightarrow \sum_{t = m+1}^{T}\cfrac{(y_t-x_t'\beta_{j}^{(l+1)})x_t}{\sigma_{(l+1)}^2} P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0. \end{align*}\]
And (45) is true because: \[\begin{align*} &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\left(\cfrac{\sigma_{(l+1)}^2}{2}-\cfrac{(y_t-x_t'\beta_{s_t}^{(l+1)})^2}{2}\right)P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\ &\Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\left(\cfrac{\sigma_{(l+1)}^2}{2} - \cfrac{(y_t -x_t'\beta_{s_t}^{(l+1)})^2}{2}\right)P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = 0 \\ &\Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)\cfrac{\sigma_{(l+1)}^2}{2} = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N} \cfrac{(y_t -x_t'\beta_{s_t}^{(l+1)})^2}{2}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) \\& \Leftrightarrow (T-m)\cfrac{\sigma_{(l+1)}^2}{2} = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cfrac{(y_t - x_t'\beta_{s_t}^{(l+1)})^2}{2}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) \\ &\Leftrightarrow \sigma_{(l+1)}^2 = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cfrac{(y_t - x_t'\beta_{s_t}^{(l+1)})^2}{(T-m)}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) \\ &\Leftrightarrow \sigma_{(l+1)}^2 = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cfrac{(y_t\sqrt{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)} - x_t'\sqrt{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}\beta_{s_t}^{(l+1)})^2}{(T-m)}. \end{align*}\] Conveniently we can estimate in this specific case all parameters via an OLS Regression. Let us assume we have the parameter vector from the previous iteration \(\lambda_l\) (to start the algorithm one starts with a random \(\lambda_0\)), we first define: \[\begin{align*} \displaystyle y_t^* = y_t\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} \quad \text{and} \quad x_t^* = x_t\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}. \end{align*}\] Then we regress \(y_t^*\) on \(x_t^*\). Thus \(\sum_{t = m+1}^{T}(y_t^* - x_t'^*\beta_j)^2\) should be minimized, which leads to the following FOC that characterises \(\beta_j^{(l+1)}\): \[\begin{align*} &\sum_{t = m+1}^{T}2(y_t^* - (x_t^*)'\beta_j^{(l+1)})(-1)x_t^* = 0 \\ &\Leftrightarrow \sum_{t = m+1}^{T}(y_t\sqrt{P_{\lambda_l}(S_t = j |\mathcal{Y}_T = \vec{y}_T)} - x_t'\sqrt{P_{\lambda_l}(S_t = j |\mathcal{Y}_T = \vec{y}_T)}\beta_j^{(l+1)})x_t\sqrt{P_{\lambda_l}(S_t = j |\mathcal{Y}_T = \vec{y}_T)} = 0 \\ &\Leftrightarrow \sum_{t = m+1}^{T}(y_t - x_t'\beta_j^{(l+1)}) x_t P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0. \end{align*}\] Which is equivalent to conditon (44). This is now done \(N\) times to estimate \(\beta_1^{(l+1)},...,\beta_N^{(l+1)}\). By squaring and summing the residuals we get our estimate of \(\sigma^2\): \[\begin{align*} \displaystyle \sigma_{(l+1)}^2 = \cfrac{1}{(T-m)}\sum_{j = 1}^{N}\sum_{t = m+1}^{T}(y_t^* - (x_t^*)'\beta_j^{(l+1)})^2. \end{align*}\]
Summarizing one can say, that we achieve our estimate for the \(\beta_j\) of the next iteration by solving the following optimization problem: \[\begin{equation} \displaystyle \arg\min_{\beta_j} \sum_{t = m+1}^{T}(y_t^* - x_t'^*\beta_j)^2. \end{equation}\] And then calculate our estimate of the \(\sigma^2\) of the next iteration with: \[\begin{equation} \displaystyle \sigma^2 = \cfrac{1}{(T-m)}\sum_{j = 1}^{N}\sum_{t = m+1}^{T}(y_t^* - (x_t^*)'\beta_j)^2. \end{equation}\] Therefore, we have now a specific algorithm for estimating the parameters of a Markov-Switching AR model, where all coefficients switch and the error term variance does not switch. This is the case presented in Hamilton (1990, page 56-58). Sadly Hamilton does not show how to apply the EM algorithm to broader structures of AR models, therefore the following examples are the application of the EM algorithm to broader defined AR processes. To the best of the authors knowledge this presentation of applying the EM algorithm to broader defined specific AR processes is novel.
Example 1: Non-Switching Intercept
From the previous derivations, we know that the equations (38), (39), and (40) hold. The five applications of the EM algorithm that we now specifically present are within the class of models for which these three equations were originally derived, i.e. an autoregressive process with a maximum lag order. As opposed to Example 0 we vary the parameters influenced by the underlying Markov-Chain. It is important to emphasize that all models presented are still Markov-Switching AR(m) models with gaussian white noise that fulfill the assumptions made in section 4.1, i.e that fulfill (11), (12), (16) and (17). Assumption (11) is fulfilled because the parameters that describe the generation of \(Y_t\) are only directly influenced by the current state of the Markov-Chain \(s_t\) and not by earlier states of the Markov-Chain \(s_{t-1},s_{t-2},...\), as described in 4.1.\\ With that general short discussion out of the way we now turn to Example 1. This time we assume that the coefficients switch, while the intercept and the error term variance do not switch. The assumed process therefore has the following form: \[\begin{align*} Y_t = c + \phi_{1,s_t}Y_{t-1}+...+\phi_{m,s_t}Y_{t-m} + U_t; \quad \text{where} \quad U_{t} \overset{i.i.d.}{\sim} N(0,\sigma^2). \end{align*}\] We could alternatively write: \[\begin{align*} Y_t = c + X_t'\phi_{s_t} + U_t \quad \text{with} \quad X_t = \begin{pmatrix} Y_{t-1} \\Y_{t-2} \\... \\Y_{t-m} \end{pmatrix}, \quad \phi_{s_t} = \begin{pmatrix} \phi_{1,s_t} \\\phi_{2,s_t} \\... \\\phi_{m,s_t} \end{pmatrix} \quad \text{and} \quad \beta_{s_t} = \begin{pmatrix} c \\\phi_{1,s_t} \\\phi_{2,s_t} \\... \\\phi_{m,s_t} \end{pmatrix}. \end{align*}\]
We start again with the conditional log-likelihood, which has the following form: \[\begin{align*} \displaystyle \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t)) = \ln(\cfrac{1}{\sqrt{2\pi}\sigma}) - \cfrac{(y_t - c - x_t'\phi_{s_t})^2}{2\sigma^2}. \end{align*}\] Now we approach this very similar to how we approached Example 0, we take the derivative in \(\alpha\), only that now there is a switching and a non-switching part in \(\beta\), we have to take the derivative in each. It holds that: \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial c} = \cfrac{(y_t - c -x_t'\phi_{s_t})}{\sigma^2} \quad \forall s_t \in \{1,...,N\}. \end{align*}\] We substitute our result in (39): \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{(y_t - c_{(l+1)} -x_t'\phi_{s_t}^{(l+1)})}{\sigma_{(l+1)}^2}P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cfrac{(y_t - c_{(l+1)} -x_t'\phi_{s_t}^{(l+1)})}{\sigma_{(l+1)}^2}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - c_{(l+1)} -x_t'\phi_{s_t}^{(l+1)})P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - x_t'\phi_{s_t}^{(l+1)})P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}c_{(l+1)}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - x_t'\phi_{s_t}^{(l+1)})P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = (T-m)c_{(l+1)} \\& \Leftrightarrow c_{(l+1)} = \cfrac{1}{(T-m)}\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - x_t'\phi_{s_t}^{(l+1)})P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T). \end{align*}\] The last result: \[\begin{equation} \displaystyle c_{(l+1)} = \cfrac{1}{(T-m)}\sum_{t = m+1}^{T}\sum_{j = 1}^{N}(y_t - x_t'\phi_{s_t}^{(l+1)})P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) \end{equation}\] can be understood as a constraint. Next, we take the derivative in \(\phi_{j}\): \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \phi_j} = \cfrac{(y_t - c - x_t'\phi_j)x_t}{\sigma^2}, \quad \text{if $S_t = j$, else $0$}. \end{align*}\] We insert in (39): \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \phi_j}\Big|_{\alpha = \alpha^{(l+1)}} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\cfrac{(y_t - c_{(l+1)} - x_t'\phi_j^{(l+1)})x_t}{\sigma_{(l+1)}^2}P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}(y_t - c_{(l+1)} - x_t'\phi_j^{(l+1)})x_tP_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0. \end{align*}\] This is equivalent to the FOC of the following optimization problem: \[\begin{equation} \displaystyle \arg\min_{\phi_j} \sum_{t = m+1}^{T}\left((y_t - c)\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} - x_t'\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}\phi_j\right)^2. \end{equation}\] Finally, it remains to differentiate with respect to \(\sigma^2\). In this special case, the differentiation proceeds exactly as in Hamilton’s example, since \(\sigma^2\) still does not switch. Thus, we have: \[\begin{align*} \displaystyle \sigma_{(l+1)}^2 = \cfrac{1}{(T-M)}\sum_{t = m+1}^T\sum_{j = 1}^{N}(y_t - c_{(l+1)} - x_t'\phi_j^{(l+1)})^2P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T). \end{align*}\] It thus becomes apparent that \(\sigma^2\) does not affect the conditions for \(c\) and \(\phi_j\), while these in turn do influence the condition for \(\sigma^2\). Accordingly, as in Hamilton’s example, one can first solve for \(\beta_j\) before determining \(\sigma^2\). However, what changed is that finding \(\beta_j\) is no longer a simple optimization step, since two conditions must now be satisfied simultaneously—namely, those for \(\phi_j\) and \(c\). It turns out that simultaneously satisfying both conditions is equivalent to optimizing expression (49) subject to the constraint given by (48). We therefore need to solve the following optimisation problem for \(\beta_j\): \[\begin{align*} \displaystyle &\arg\min_{\phi_j} \sum_{t = m+1}^{T}\left((y_t - c)\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} - x_t'\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}\phi_j\right)^2 \quad s.t. \\&c = \cfrac{1}{(T-m)}\sum_{t = m+1}^{T}\sum_{j = 1}^{N}(y_t - x_t'\phi_{j})P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T). \end{align*}\] Now we can use this \(\beta_j\), similarly to how we know it from Hamilton and get: \[\begin{align*} \displaystyle \sigma^2 = \cfrac{1}{(T-M)}\sum_{t = m+1}^T\sum_{j = 1}^{N}(y_t - c - x_t'\phi_j)^2P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T). \end{align*}\] This concludes our \(\alpha\) vector for the next iteration.
Example 2: Switching Intercept and Non-Switching Coefficients
This time, we reverse the roles; instead of letting the coefficients switch, now only the intercept switches. Therefore, the assumed process would have the following form: \[\begin{align*} Y_t = c_{s_t} + \phi_{1}Y_{t-1}+...+\phi_{m}Y_{t-m} + U_t; \quad \text{where} \quad U_{t} \overset{i.i.d.}{\sim} N(0,\sigma^2), \end{align*}\] we could alternatively write: \[\begin{align*} Y_t = c_{s_t} + X_t'\phi + U_t \quad \text{with} \quad X_t = \begin{pmatrix} Y_{t-1} \\Y_{t-2} \\... \\Y_{t-m} \end{pmatrix}, \quad \phi_{s_t} = \begin{pmatrix} \phi_{1} \\\phi_{2} \\... \\\phi_{m} \end{pmatrix} \quad \text{and} \quad \beta_{s_t} = \begin{pmatrix} c_{s_t} \\\phi_{1} \\\phi_{2} \\... \\\phi_{m} \end{pmatrix}. \end{align*}\]
In this case, we differentiate with respect to \(\phi\), \(c_j\), and \(\sigma^2\). It is important to note that the error term variance still does not switch. For this setup, we can write: \[\begin{align*} \displaystyle \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t)) = \ln(\cfrac{1}{\sqrt{2\pi}\sigma}) - \cfrac{(y_t - c_{s_t} - x_t'\phi)^2}{2\sigma^2}. \end{align*}\] First we differentiate with respect to \(\phi\) and get: \[\begin{align*} \displaystyle &\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \phi} = \cfrac{(y_t - c_{s_t} -x_t'\phi)x_t}{\sigma^2} \quad \forall s_t \in \{1,...,N\}. \end{align*}\] We insert in (39), this leads us to: \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{(y_t - c_{s_t}^{(l+1)} -x_t'\phi^{(l+1)})x_t}{\sigma_{(l+1)}^2}P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cfrac{(y_t - c_{s_t}^{(l+1)} - x_t'\phi^{(l+1)})x_t}{\sigma_{(l+1)}^2}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - c_{s_t}^{(l+1)} - x_t'\phi^{(l+1)})x_tP_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - c_{s_t}^{(l+1)})x_tP_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}x_t' \phi^{(l+1)} x_t P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - c_{s_t}^{(l+1)})x_tP_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) x_t x_t' \phi^{(l+1)} \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - c_{s_t}^{(l+1)})x_tP_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = \left(\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) x_tx_t'\right) \phi^{(l+1)} \\& \Leftrightarrow \phi^{(l+1)} = \left(\sum_{t = m+1}^{T}x_tx_t'\right)^{-1}\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - c_{s_t}^{(l+1)})x_tP_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T). \end{align*}\] Next we differentiate with respect to \(c_j\): \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial c_j} = \cfrac{(y_t - c_{j} -x_t'\phi)}{\sigma^2} \quad \text{if $S_t = j$, else 0}. \end{align*}\] Similarily we insert the result in (39): \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial c_j}\Big|_{\alpha = \alpha^{(l+1)}} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\cfrac{(y_t - c_j^{(l+1)} - x_t'\phi^{(l+1)})}{\sigma_{(l+1)}^2}P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}(y_t - c_j^{(l+1)} - x_t'\phi^{(l+1)})P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0. \end{align*}\] It should be noted that the last equation is the FOC of the following optimization problem: \[\begin{equation} \displaystyle \arg\min_{c_j} \sum_{t = m+1}^{T}\left((y_t - x_t'\phi)\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} - c_j\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}\right)^2. \end{equation}\] For \(\sigma^2\), the same condition applies as already formulated by Hamilton, because \(\sigma^2\) again does not switch. Thus, we are in a very similar situation as in the previous example, because the conditions resulting from the derivative with respect to \(c_j\), as well as the condition resulting from the derivative with respect to \(\phi\), must be satisfied simultaneously and affect the condition for \(\sigma^2\), whereas \(\sigma^2\) does not affect the former conditions. Therefore, one can first satisfy the first two conditions for all \(j\) in order to obtain \(\beta_j\) for all \(j\), and then determine \(\sigma^2\) for the next iteration. Furthermore, it follows again that \(\beta_j\), for a given \(j\), can be found by solving an optimization problem with an equality constraint. In order to obtain \(\beta_j\), the following optimization problem must be solved: \[\begin{align*} \displaystyle &\arg\min_{c_j} \sum_{t = m+1}^{T}\left((y_t - x_t'\phi)\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} - c_j\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}\right)^2 \quad s.t. \\&\phi = \left(\sum_{t = m+1}^{T}x_tx_t'\right)^{-1}\sum_{t = m+1}^{T}\sum_{j = 1}^{N}(y_t - c_j)x_tP_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T). \end{align*}\] As usual we compute: \[\begin{align*} \displaystyle \sigma^2 = \cfrac{1}{(T-M)}\sum_{t = m+1}^T\sum_{j = 1}^{N}(y_t - c - x_t'\phi_j)^2P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T), \end{align*}\] and with that we get out \(\alpha\) vector for the next iteration.
Example 3: All Parameters Switch
As a third example, we now consider the case in which all parameters are allowed to switch. That is, we now allow not only the coefficients and the intercept to switch, but also the variance of the error term. It is important to note that in setups discussed earlier one could have made stronger assumptions, namely it would have been easier to assume in Example 0 to Example 2, that \(S_t\) was independent of \(U_{\tau}\) for all \(t\) and \(\tau\). Instead we made the weaker assumptions (16) and (17) so that we can now introduce models where the error term variance is allowed to switch, that wouldn’t have been possible with the stronger set of assumptions. That is the reason why all derivations earlier were done with this weaker set of assumptions. That said, we want to point out that we still make the same assumptions as in section 4.1, this is possible due to the weaker set of assumptions, a stronger set of assumptions wouldn’t allow for models like Example 3 and Example 5. Additionally, it should be noted that, in terms of notation, we now again include a \(1\) as the first element of \(X_t\) to represent the intercept. This leads to the following process formulation: \[\begin{align*} y_t = c_{s_t} + \phi_{1,s_t}Y_{t-1}+...+\phi_{m,s_t}Y_{t-m} + U_t; \quad \text{where} \quad U_{t} \overset{}{\sim} N(0,\sigma_{s_t}^2), \end{align*}\] we could alternatively write: \[\begin{align*} Y_t = X_t'\beta_{s_t} + U_t \quad \text{with} \quad X_t = \begin{pmatrix} 1 \\Y_{t-1} \\Y_{t-2} \\... \\Y_{t-m} \end{pmatrix} \quad \text{and} \quad \beta_{s_t} = \begin{pmatrix} c_{s_t} \\\phi_{1,s_t} \\\phi_{2,s_t} \\... \\\phi_{m,s_t} \end{pmatrix}. \end{align*}\]
As usual we start with the conditional log-likelihood: \[\begin{align*} \displaystyle \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t)) = \ln(\cfrac{1}{\sqrt{2\pi}\sigma_{s_t}}) - \cfrac{(y_t - x_t\beta_{s_t})^2}{2\sigma_{s_t}^2}. \end{align*}\] We now need to take the derivative once with respect to \(\beta_j\) and once with respect to \(\sigma_{j}^2\) for a given \(j\). If we first take the derivative with respect to \(\beta_j\), we obtain: \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_j} = \cfrac{(y_t - x_t'\beta_j)x_t}{\sigma_{j}^2} \quad \text{if $S_t = j$, else $0$}. \end{align*}\] We now substitute this into (39) and obtain: \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_j}\Big|_{\alpha = \alpha^{(l+1)}} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\cfrac{(y_t - x_t'\beta_j^{(l+1)})x_t}{\sigma_{j,(l+1)}^2}P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0. \end{align*}\] We again observe that this corresponds to the FOC of an optimization problem, namely the following optimization problem: \[\begin{align*} \arg\min_{\beta_j} \sum_{t = m+1}^{T}\left(\cfrac{y_t}{\sigma_j}\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} - \cfrac{x_t'}{\sigma_j}\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}\beta_j\right)^2. \end{align*}\] Next we take the derivative with respect to \(\sigma_{j}^{-2}\) and get: \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \sigma_{j}^{-2}} = \cfrac{\sigma_j^{2}}{2} - \cfrac{(y_t - x_t'\beta_j)^2}{2} \quad \text{if $S_t = j$, else $0$}. \end{align*}\] We substitute the result into (39), this leads to: \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \sigma_{j}^{-2}}\Big|_{\alpha = \alpha^{(l+1)}} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\left(\cfrac{\sigma_{j,(l+1)}^{2}}{2} - \cfrac{(y_t - x_t'\beta_j^{(l+1)})^2}{2}\right)P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}(\sigma_{j,(l+1)}^{2} - (y_t - x_t'\beta_j^{(l+1)})^2)P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sigma_{j,(l+1)}^{2}P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}(y_t - x_t'\beta_j^{(l+1)})^2P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) \\& \Leftrightarrow \sigma_{j,(l+1)}^{2}\sum_{t = m+1}^{T}P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}(y_t - x_t'\beta_j^{(l+1)})^2P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) \\& \Leftrightarrow \sigma_{j,(l+1)}^{2} = \cfrac{\sum_{t = m+1}^{T}(y_t - x_t'\beta_j^{(l+1)})^2P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}{\sum_{t = m+1}^{T}P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}. \end{align*}\] As a result, since \(\sigma_j\) now affects the condition for \(\beta_j\) and vice versa, we once again arrive at a constrained optimization problem, which leads to \(\alpha\) of the next iteration. The constrained optimization problem is given by: \[\begin{align*} &\arg\min_{\beta_j} \sum_{t = m+1}^{T}\left(\cfrac{y_t}{\sigma_j}\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} - \cfrac{x_t'}{\sigma_j}\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}\beta_j\right)^2 \quad s.t. \\&\sigma_j = \sqrt{\cfrac{\sum_{t = m+1}^{T}(y_t - x_t'\beta_j)^2P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}{\sum_{t = m+1}^{T}P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}}. \end{align*}\]
Example 4: Arbitrary Subset-Switching of \((c,\phi)\) and Non-Switching \(\sigma^2\)
The three previous examples are essentially special cases of the two model formulations that follow. Therefore, the next two examples represent the most general forms of Markov-Switching AR(m) models that will appear in this paper. The model class introduced next assumes that \(\sigma^2\) does not switch, and that an arbitrary subset of \(\beta\) is allowed to switch. Accordingly, we divide the parameter vector \(\beta\) into the switching components, denoted by \(\beta^S\), and the non-switching components, denoted by \(\beta^F\). The underlying process is therefore formulated as follows: \[\begin{align*} Y_t = (X_t^F)'\beta^F + (X_t^S)'\beta_{s_t}^S +U_t; \quad \text{where} \quad U_{t} \overset{i.i.d.}{\sim} N(0,\sigma^2) \quad \text{and} \quad X_t = \begin{pmatrix} 1 \\Y_{t-1} \\Y_{t-2} \\... \\Y_{t-m} \end{pmatrix}. \end{align*}\] Thereby \(X_t^F\) and \(X_t^S\) are defined such that their elements do not overlap and that the elements of both vectors together are the elements of \(X_t\), to put it more formally: Let \(I^F, I^S \subset \{1, \dots, m+1\}\) be disjoint index sets such that \(I^F \cap I^S = \emptyset\) and \(I^F \cup I^S = \{1, \dots, m+1\}\), where \(m+1\) is the number of coefficients plus intercept. Then we define [ X_t^F = ((X_{t})i){i I^F}, X_t^S = ((X_{t})i){i I^S}. ] Again we start with the conditional log-likelihood: \[\begin{align*} \displaystyle \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t)) = \ln(\cfrac{1}{\sqrt{2\pi}\sigma}) - \cfrac{(y_t - (x_t^S)'\beta_{s_t}^S - (x_t^F)'\beta^F)^2}{2\sigma^2}. \end{align*}\] Accordingly, for this model class, we need to take the derivative with respect to \(\sigma^2\), \(\beta_j^S\), and \(\beta^F\). We will start with \(\beta_j^S\): \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_j^S} = \cfrac{(y_t - (x_t^S)'\beta_j^S - (x_t^F)'\beta^F)x_{t}^{S}}{\sigma^2} \quad \text{if $S_t = j$, else $0$}. \end{align*}\] We can now substitute this into (39) and obtain: \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_j^S}\Big|_{\alpha = \alpha^{(l+1)}} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}(y_t - (x_t^S)'\beta_{j,(l+1)}^S - (x_t^F)'\beta_{(l+1)}^F)x_t^S P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0. \end{align*}\] This, in turn, corresponds to the FOC of the following optimization problem: \[\begin{align*} \arg\min_{\beta_j^S} \sum_{t = m+1}^{T}\left((y_t - (x_t^F)'\beta^F)\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} - (x_t^S)'\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}\beta_j^S\right)^2. \end{align*}\] Next, if we take the derivative with respect to \(\beta^F\), we obtain: \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta^F} = \cfrac{(y_t - (x_t^S)'\beta_j^S - (x_t^F)'\beta^F)x_t^F}{\sigma^2} \quad \forall s_t \in \{1,...,N\}. \end{align*}\] We substitute this into (39) and obtain: \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S - (x_t^F)'\beta_{(l+1)}^F)x_t^F}{\sigma_{(l+1)}^2}P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S - (x_t^F)'\beta_{(l+1)}^F)x_t^F P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(x_t^F)'\beta_{(l+1)}^Fx_t^F P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}(x_t^F)'\beta_{(l+1)}^Fx_t^F \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}x_t^F(x_t^F)'\beta_{(l+1)}^F \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F P_{\lambda_l}(s_t = s_t|\mathcal{Y}_T = \vec{y}_T) = \left(\sum_{t = m+1}^{T}x_t^F(x_t^F)'\right)\beta_{(l+1)}^F \\& \Leftrightarrow \beta_{(l+1)}^F = \left(\sum_{t = m+1}^{T}x_t^F(x_t^F)'\right)^{-1}\left(\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)\right). \end{align*}\] For this generalized model class, where arbitrary parameters can switch except for the error term variance, which can not switch, we thus obtain the following constrained optimization problem to determine \(\beta^F\) and \(\beta_j^S\) for all \(j\). \[\begin{align*} &\arg\min_{\beta_j^S} \sum_{t = m+1}^{T}\left((y_t - (x_t^F)'\beta^F)\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)} - (x_t^S)'\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}\beta_j^S\right)^2 \quad s.t. \\& \beta^F = \left(\sum_{t = m+1}^{T}x_t^F(x_t^F)'\right)^{-1}\left(\sum_{t = m+1}^{T}\sum_{j = 1}^{N}(y_t - (x_t^S)'\beta_j^S)x_t^F P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)\right). \end{align*}\] We then calculate, as usual: \[\begin{align*} \displaystyle \sigma^2 = \cfrac{1}{(T-M)}\sum_{t = m+1}^T\sum_{j = 1}^{N}(y_t - x_t'\beta_j)^2 P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T). \end{align*}\] and thus obtain our new \(\alpha\) vector for the next iteration.
Example 5: Arbitrary Subset-Switching of \((c,\phi)\) and Switching \(\sigma^2\)
With Example 5, we complete the generalization of the application of the EM algorithm to underlying AR(m) models, as Example 4 and Example 5 together allow for selecting any arbitrary subset of parameters in an AR(m) context for switching. In this final example, we assume the following underlying process: \[\begin{align*} Y_t = (X_t^F)'\beta^F + (X_t^S)'\beta_{s_t}^S +U_t; \quad \text{where} \quad U_{t} \overset{}{\sim} N(0,\sigma_{s_t}^2) \quad \text{and} \quad X_t = \begin{pmatrix} 1 \\Y_{t-1} \\Y_{t-2} \\... \\Y_{t-m} \end{pmatrix}. \end{align*}\] Thereby, \(X_t^F\) and \(X_t^S\) are defined such that their elements do not overlap and that the elements of both vectors together are the elements of \(X_t\), to put it more formally: Let \(I^F, I^S \subset \{1, \dots, m+1\}\) be disjoint index sets such that \(I^F \cap I^S = \emptyset\) and \(I^F \cup I^S = \{1, \dots, m+1\}\), where \(m+1\) is the number of coefficients plus intercept. Then we define [ X_t^F = ((X_{t})i){i I^F}, X_t^S = ((X_{t})i){i I^S}. ] Again, we start with the conditional log-liklihood, which would be in this case: \[\begin{align*} \displaystyle \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t)) = \ln\left(\cfrac{1}{\sqrt{2\pi}\sigma_{s_t}}\right) - \cfrac{(y_t - (x_t^S)'\beta_{s_t}^S - (x_t^F)'\beta^F)^2}{2\sigma_{s_t}^2}. \end{align*}\] Accordingly, we need to take the derivative with respect to \(\sigma_j^2\), \(\beta_j^S\), and \(\beta^F\). We begin with the derivative with respect to \(\beta_j^S\). \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_j^S} = \cfrac{(y_t - (x_t^S)'\beta_j^S - (x_t^F)'\beta^F)x_{t}^{S}}{\sigma_j^2} \quad \text{if $S_t = j$, else $0$}. \end{align*}\] We now substitute this into (39) and obtain: \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta_j^S}\Big|_{\alpha = \alpha^{(l+1)}} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\cfrac{(y_t - (x_t^S)'\beta_{j,(l+1)}^S - (x_t^F)'\beta_{(l+1)}^F)x_t^S}{\sigma_{j,(l+1)}^2}P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0. \end{align*}\] This, in turn, corresponds to the FOC of the following optimization problem: \[\begin{align*} \arg\min_{\beta_j^S} \sum_{t = m+1}^{T}\left((y_t - (x_t^F)'\beta^F)\cfrac{\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}}{\sigma_j} - (x_t^S)'\cfrac{\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}}{\sigma_j}\beta_j^S\right)^2. \end{align*}\] We now move on to the next step and take the derivative with respect to \(\beta^F\), obtaining: \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \beta^F} = \cfrac{(y_t - (x_t^S)'\beta_{s_t}^S - (x_t^F)'\beta^F)x_t^F}{\sigma_{s_t}^2} \quad \forall s_t \in \{1,...,N\}. \end{align*}\] We then substitute this into (39) and obtain: \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S - (x_t^F)'\beta_{(l+1)}^F)x_t^F}{\sigma_{s_t,{(l+1)}}^2}P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cfrac{(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S - (x_t^F)'\beta_{(l+1)}^F)x_t^F}{\sigma_{s_t,(l+1)}^2}P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2} = \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(x_t^F)'\beta_{(l+1)}^Fx_t^F\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2} \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2} = \sum_{t = m+1}^{T}(x_t^F)'\beta_{(l+1)}^Fx_t^F\sum_{s_t = 1}^{N}\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2} \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2} = \sum_{t = m+1}^{T}x_t^F(x_t^F)'\beta_{(l+1)}^F\sum_{s_t = 1}^{N}\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2} \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2} = \sum_{t = m+1}^{T}x_t^F(x_t^F)'\sum_{s_t = 1}^{N}\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2}\beta_{(l+1)}^F \\& \Leftrightarrow \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2} = \left(\sum_{t = m+1}^{T}x_t^F(x_t^F)'\sum_{s_t = 1}^{N}\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2}\right)\beta_{(l+1)}^F \\& \Leftrightarrow \beta_{(l+1)}^F = \left(\sum_{t = m+1}^{T}x_t^F(x_t^F)'\sum_{s_t = 1}^{N}\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2}\right)^{-1}\left(\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}(y_t - (x_t^S)'\beta_{s_t,(l+1)}^S)x_t^F\cfrac{P_{\lambda_l}(S_t = s_t|\mathcal{Y}_T = \vec{y}_T)}{\sigma_{s_t,(l+1)}^2}\right). \end{align*}\] As a third step, we now take the derivative with respect to \(\sigma_j^{-2}\) and obtain: \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \sigma_j^{-2}} = \cfrac{\sigma_j^2}{2} - \cfrac{(y_t - (x_t^S)'\beta_j^S - (x_t^F)'\beta^F)^2}{2} \quad \text{if $S_t = j$ else $0$}. \end{align*}\] We now substitute this into (39) and obtain: \[\begin{align*} \displaystyle &\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \sigma_j^{-2}}\Big|_{\alpha = \alpha^{(l+1)}} P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\left(\cfrac{\sigma_{j,(l+1)}^2}{2} - \cfrac{(y_t - (x_t^S)'\beta_{j,(l+1)}^S - (x_t^F)'\beta_{(l+1)}^F)^2}{2} \right)P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = 0 \\& \Leftrightarrow \sum_{t = m+1}^{T}\sigma_{j,(l+1)}^2 P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}(y_t - (x_t^S)'\beta_{j,(l+1)}^S - (x_t^F)'\beta_{(l+1)}^F)^2 P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) \\& \Leftrightarrow \sigma_{j,(l+1)}^2\sum_{t = m+1}^{T} P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) = \sum_{t = m+1}^{T}(y_t - (x_t^S)'\beta_{j,s_t}^S - (x_t^F)'\beta_{(l+1)}^F)^2 P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T) \\& \Leftrightarrow \sigma_{j,(l+1)}^2 = \cfrac{\sum_{t = m+1}^{T}(y_t - (x_t^S)'\beta_{j,s_t}^S - (x_t^F)'\beta_{(l+1)}^F)^2 P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}{\sum_{t = m+1}^{T} P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}. \end{align*}\] We can now combine these three results into a single optimization problem, which leads us to the next \(\alpha\). This is necessary because, once again, all three conditions must be satisfied simultaneously and cannot be implemented sequentially, as each of the three variables plays a role in the different conditions. We thus obtain another constrained optimization problem, but this time under two constraints: \[\begin{align*} &\arg\min_{\beta_j^S} \sum_{t = m+1}^{T}\left((y_t - (x_t^F)'\beta^F)\cfrac{\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}}{\sigma_j} - (x_t^S)'\cfrac{\sqrt{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}}{\sigma_j}\beta_j^S\right)^2 \quad s.t.\\ &\beta^F = \left(\sum_{t = m+1}^{T}x_t^F(x_t^F)'\sum_{j = 1}^{N}\cfrac{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}{\sigma_j^2}\right)^{-1}\left(\sum_{t = m+1}^{T}\sum_{j = 1}^{N}(y_t - (x_t^S)'\beta_j^S)x_t^F\cfrac{P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}{\sigma_j^2}\right) \\ &\sigma_j = \sqrt{\cfrac{\sum_{t = m+1}^{T}(y_t - (x_t^S)'\beta_j^S - (x_t^F)'\beta^F)^2 P_{\lambda_l}(S_t = j|\mathcal{Y}_T = \vec{y}_T)}{\sum_{t = m+1}^{T} P_{\lambda_l}(s_t = j|\mathcal{Y}_T = \vec{y}_T)}}. \end{align*}\]
Forecasting with Markov-Switching Models
Next, we would like to turn to the topic of forecasts using Markov-Switching models. We recall that the conditional density is given by \(f_{Y_t|S_t,\mathcal{Y}_{t-1}; \alpha}(y_t|j,\vec{y}_{t-1})\). If we now have \(\vec{y}_t\) and \(s_{t+1}\), we could, for a simple AR(1) model, state: \[\begin{align*} Y_{t+1} = c_{s_{t+1}} + \phi_{s_{t+1}}Y_t + U_{t+1}, \end{align*}\] where we have \(E_{\alpha}(Y_{t+1}|S_{t+1} = j, \mathcal{Y}_t = \vec{y}_t) = c_j + \phi_j y_t\). Furthermore, we can show the following for an m-step ahead forecast: \[\begin{align*} \displaystyle E_{\theta}(Y_{t+m}|\mathcal{Y}_t = \vec{y}_t) &= \int y_{t+m}f_{Y_{t+m}|\mathcal{Y}_t;\theta}(y_{t+m}|\vec{y}_t) \,dy_{t+m} \\ &= \int y_{t+m}\left(\sum_{j = 1}^{N}f_{Y_{t+m},S_{t+m}|\mathcal{Y}_T;\theta}(y_{t+m},j|\vec{y}_t)\right) \,dy_{t+m} \\ &= \int y_{t+m}\left(\sum_{j = 1}^{N}f_{Y_{t+1}|S_{t+m},\mathcal{Y}_t;\alpha}(y_{t+m}|j,\vec{y}_t)P_{\theta}(S_{t+m} = j|\mathcal{Y}_t = \vec{y}_t)\right) \,dy_{t+m} \\ &= \sum_{j = 1}^{N}P_{\theta}(S_{t+m} = j|\mathcal{Y}_t = \vec{y}_t) \int y_{t+m}f_{Y_{t+m}|S_{t+m},\mathcal{Y}_t;\alpha}(y_{t+m}|j,\vec{y}_t) \,dy_{t+m} \\ &= \sum_{j = 1}^{N}P_{\theta}(S_{t+m} = j|\mathcal{Y}_t = \vec{y}_t) E_{\alpha}(Y_{t+m}|S_{t+m} = j,\mathcal{Y}_t = \vec{y}_t). \end{align*}\] One could thus say that the forecasts for \(y_{t+m}\) correspond to a weighted average of the expected values given the regime, with the regime probabilities as the weights. To summarize this notation, we can say that we collect the \(E_{\alpha}(Y_{t+m}|S_{t+m} = j, \mathcal{Y}_t = \vec{y}_t)\) in \(\mathbf{h_t'}\), so that we can write \(E_{\theta}(Y_{t+m}|\mathcal{Y}_t = \vec{y}_t) = \mathbf{h_t'} \hat{\zeta}_{t+m|t}\), this derivation closely followed Hamilton (1994, page 694-695).
Regime Forecasting with Markov-Switching Models
The probability that the Markov-Chain will be in a particular state in the future can be considered as \(E_{\theta}(\zeta_{t+m}|\mathcal{Y}_t = \vec{y}_t)\). Based on (14), this is given by: \[\begin{equation} E_{\theta}(\zeta_{t+m}|\mathcal{Y}_t = \vec{y}_t) = (\Pi')^mE_{\theta}(\zeta_t|\mathcal{Y}_t = \vec{y}_t), \end{equation}\] or alternatively: \[\begin{equation} \hat{\zeta}_{t+m|t} = (\Pi')^m \hat{\zeta}_{t|t}. \end{equation}\]
Appendix
Optimal Inference of the Regimes and Derivation of the Log-Likelihood
The following derivation closely follows Hamilton (1994, page 693). First of all, it is essential to keep in mind that \((\hat{\zeta}_{t|t-1})_j = P_{\theta}(S_t = j|\mathcal{Y}_{t-1} = \vec{y}_{t-1})\) and \((\eta_t)_j = f_{Y_t|S_t, \mathcal{Y}_{t-1};\alpha}(y_t|j,\vec{y}_{t-1})\). Based on this we can see that: \[\begin{align*} \displaystyle (\hat{\zeta}_{t|t-1} \odot \eta_t)_j &= P_{\theta}(S_t = j|\mathcal{Y}_{t-1} = \vec{y}_{t-1})f_{Y_t|S_t,\mathcal{Y}_{t-1};\alpha}(y_t|j,\vec{y}_{t-1}) \\&= f_{Y_t,S_t|\mathcal{Y}_{t-1};\theta}(y_t,j|\vec{y}_{t-1}). \end{align*}\] If we now sum over all potential values of \(S_t\) we get: \[\begin{align*} \sum_{j = 1}^{N}f_{Y_t,S_t|\mathcal{Y}_{t-1};\theta}(y_t,j|\vec{y}_{t-1}) = f_{Y_t|\mathcal{Y}_{t-1};\theta}(y_t|\vec{y}_{t-1}) = \mathbf{1'}(\hat{\zeta}_{t|t-1} \odot \eta_t). \end{align*}\] Additionally it is therefore true that: \[\begin{align*} \cfrac{(\hat{\zeta}_{t|t-1} \odot \eta_t)_j}{\mathbf{1'}(\hat{\zeta}_{t|t-1} \odot \eta_t)} = \cfrac{f_{Y_t,S_t|\mathcal{Y}_{t-1};\theta}(y_t,j|\vec{y}_{t-1})}{f_{Y_t|\mathcal{Y}_{t-1};\theta}(y_t|\vec{y}_{t-1})} = P_{\theta}(S_t = j|\mathcal{Y}_t = \vec{y}_{t}) = (\hat{\zeta}_{t|t})_j. \end{align*}\] Utilizing vectors we can write: \[\begin{align*} \hat{\zeta}_{t|t} = \cfrac{(\hat{\zeta}_{t|t-1} \odot \eta_t)}{\mathbf{1'}(\hat{\zeta}_{t|t-1} \odot \eta_t)}. \end{align*}\] To put it as simple as possible one could say that we basically just applied Bayes Rule. Next we want to get from \((\hat{\zeta}_{t|t})_j = P_{\theta}(S_t = j|\mathcal{Y}_t = \vec{y}_{t})\) to \((\hat{\zeta}_{t+1|t})_j = P_{\theta}(S_{t+1} = j|\mathcal{Y}_t = \vec{y}_{t})\). We know from (14) that this is possible via: \[\begin{align*} E_{\theta}(\zeta_{t+1}|\mathcal{Y}_t = \vec{y}_t) = \Pi'\hat{\zeta}_{t|t}. \end{align*}\]
Smoothed Inference over the Regimes
The following derivation closely follows Hamilton (1994, page 700-702). First we note that \(S_t\) depends on \(\mathcal{Y}_{t-1}\) through \(S_{t-1}\) and on future observations only through \(S_{t+1}\)! One could say: \[\begin{align*} P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_T = \vec{y}_{T}) = P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_t = \vec{y}_{t}). \end{align*}\]
We can show this formally: \[\begin{align*} \displaystyle P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+1} = \vec{y}_{t+1}) &= P_{\theta}(S_t = j|S_{t+1} = i,Y_{t+1} = y_{t+1},\mathcal{Y}_{t} = \vec{y}_{t}) \\ &= \cfrac{P_{\theta}(S_t = j,Y_{t+1} = y_{t+1}|S_{t+1} = i,\mathcal{Y}_{t} = \vec{y}_{t})}{f_{Y_{t+1}|S_{t+1},\mathcal{Y}_t;\alpha}(y_{t+1}|i,\vec{y}_{t})} \\ &= \cfrac{f_{Y_{t+1}|S_t,S_{t+1},\mathcal{Y}_{t};\alpha}(y_{t+1}|j,i,\vec{y}_{t})}{f_{Y_{t+1}|S_{t+1},\mathcal{Y}_t;\alpha}(y_{t+1}|i,\vec{y}_{t})} P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t} = \vec{y}_{t}) \\ &= \cfrac{f_{Y_{t+1}|S_{t+1},\mathcal{Y}_t;\alpha}(y_{t+1}|i,\vec{y}_{t})}{f_{Y_{t+1}|S_{t+1},\mathcal{Y}_t;\alpha}(y_{t+1}|i,\vec{y}_{t})} P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t} = \vec{y}_{t}) \\ &= P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t} = \vec{y}_{t}). \end{align*}\] The last two steps are possible because based on (11) the distribution of \(Y_{t+1}\) conditional on \(S_{t+1}\) is independent of \(S_{t}\). Now we can approach the derivation for \(t+2\) in a similar way: \[\begin{align*} \displaystyle P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+2} = \vec{y}_{t+2}) &= P_{\theta}(S_t = j|S_{t+1} = i,Y_{t+2} = y_{t+2},\mathcal{Y}_{t+1} = \vec{y}_{t+1}) \\ & = \cfrac{P_{\theta}(S_t = j,Y_{t+2} = y_{t+2}|S_{t+1} = i,\mathcal{Y}_{t+1} = \vec{y}_{t+1}) }{f_{Y_{t+2}|S_{t+1},\mathcal{Y}_{t+1};\theta}(y_{t+2}|i,\vec{y}_{t+1})} \\ &= \cfrac{f_{Y_{t+2}|S_t,S_{t+1},\mathcal{Y}_{t+1};\theta}(y_{t+2}|j,i,\vec{y}_{t+1})}{f_{Y_{t+2}|S_{t+1},\mathcal{Y}_{t+1};\theta}(y_{t+2}|i,\vec{y}_{t+1})} P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+1} = \vec{y}_{t+1}) \\ &= \cfrac{f_{Y_{t+2}|S_{t+1},\mathcal{Y}_{t+1};\theta}(y_{t+2}|i,\vec{y}_{t+1})}{f_{Y_{t+2}|S_{t+1},\mathcal{Y}_{t+1};\theta}(y_{t+2}|i,\vec{y}_{t+1})} P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+1} = \vec{y}_{t+1}) \\ &= P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+1} = \vec{y}_{t+1}) \\ &= P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t} = \vec{y}_{t}). \end{align*}\] Simplifying the fraction is possible because: \[\begin{align*} \displaystyle f_{Y_{t+2}|S_t,S_{t+1},\mathcal{Y}_{t+1};\theta}(y_{t+2}|j,i,\vec{y}_{t+1}) &= \sum_{k = 1}^{N}f_{Y_{t+2},S_{t+2}|S_t, S_{t+1}, \mathcal{Y}_{t+1};\theta}(y_{t+2},k|j,i,\vec{y}_{t+1}) \\ &= \sum_{k = 1}^{N}f_{Y_{t+2}|S_{t+2},S_t, S_{t+1}, \mathcal{Y}_{t+1};\alpha}(y_{t+2}|k,j,i,\vec{y}_{t+1}) \\ & \hspace{0.5cm} \cdot P_{\theta}(S_{t+2}=k|S_{t+1} = i, S_{t} = j,\mathcal{Y}_{t+1} = \vec{y}_{t+1}) \\ &= \sum_{k = 1}^{N}f_{Y_{t+2}|S_{t+2}, S_{t+1}, \mathcal{Y}_{t+1};\alpha}(y_{t+2}|k,i,\vec{y}_{t+1}) \\ & \hspace{0.5cm} \cdot P_{\theta}(S_{t+2}=k|S_{t+1} = i,\mathcal{Y}_{t+1} = \vec{y}_{t+1}) \\ &= \sum_{k = 1}^{N}f_{Y_{t+2},S_{t+2}|S_{t+1},\mathcal{Y}_{t+1};\theta}(y_{t+2},k|i,\vec{y}_{t+1}) \\ &= f_{Y_{t+2}|S_{t+1}, \mathcal{Y}_{t+1};\theta}(y_{t+2}|i,\vec{y}_{t+1}). \end{align*}\] We show now by induction that this approach is generally applicable. The earlier shown cases were the start of the induction, for the induction step we can say, we choose an arbitrary, but feasible, \(n\) and our induction assumption is: \[\begin{equation} P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+n} = \vec{y}_{t+n}) = P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_t = \vec{y}_{t}). \end{equation}\] Then it shall hold that: \[\begin{equation} P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+n+1} = \vec{y}_{t+n+1}) = P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_t = \vec{y}_{t}). \end{equation}\] To show that we write: \[\begin{align*} P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+n+1} = \vec{y}_{t+n+1}) &= P_{\theta}(S_t =j|S_{t+1}=i,Y_{t+n+1} = y_{t+n+1},\mathcal{Y}_{t+n} = \vec{y}_{t+n}) \\ &= \cfrac{f_{S_t, Y_{t+n+1}|S_{t+1}, \mathcal{Y}_{t+n};\theta}(j,y_{t+n+1}|i,\vec{y}_{t+n})}{f_{Y_{t+n+1}|S_{t+1},\mathcal{Y}_{t+n};\theta}(y_{t+n+1}|i,\vec{y}_{t+n})} \\ &= \cfrac{f_{Y_{t+n+1}|S_t,S_{t+1},\mathcal{Y}_{t+n};\theta}(y_{t+n+1}|j,i,\vec{y}_{t+n})}{f_{Y_{t+n+1}|S_{t+1},\mathcal{Y}_{t+n};\theta}(y_{t+n+1}|i,\vec{y}_{t+n})} \\& \hspace{0.5cm} \cdot P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+n} = \vec{y}_{t+n}) \\ &= P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_{t+n} = \vec{y}_{t+n}) \\ &= P_{\theta}(S_t = j|S_{t+1} = i,\mathcal{Y}_t = \vec{y}_{t}). \end{align*}\] Thereby, the last step is just applying the induction assumption and simplifying the fraction is possible because: \[\begin{align*} f_{Y_{t+n+1}|S_t, S_{t+1}, \mathcal{Y}_{t+n};\theta}(y_{t+n+1}|j,i,\vec{y}_{t+n}) &= \sum_{k = 1}^{N}f_{Y_{t+n+1},S_{t+n+1}|S_t, S_{t+1}, \mathcal{Y}_{t+n};\theta}(y_{t+n+1},k|j,i, \vec{y}_{t+n}) \\ &= \sum_{k = 1}^{N}f_{Y_{t+n+1}|S_{t+n+1}, S_t, S_{t+1},\mathcal{Y}_{t+n};\alpha}(y_{t+n+1}|k,j,i,\vec{y}_{t+n}) \\& \hspace{0.5cm} \cdot P_{\theta}(S_{t+n+1} = k|S_t = j, S_{t+1} = i, \mathcal{Y}_{t+n} = \vec{y}_{t+n}) \\ &= \sum_{k = 1}^{N}f_{Y_{t+n+1}|S_{t+n+1}, S_{t+1},\mathcal{Y}_{t+n};\alpha}(y_{t+n+1}|k,i,\vec{y}_{t+n}) \\& \hspace{0.5cm} \cdot P_{\theta}(S_{t+n+1} = k| S_{t+1} = i, \mathcal{Y}_{t+n} = \vec{y}_{t+n}) \\ &= \sum_{k = 1}^{N}f_{Y_{t+n+1},S_{t+n+1}|S_{t+1},\mathcal{Y}_{t+n};\theta}(y_{t+n+1},k|i,\vec{y}_{t+n}) \\ &= f_{Y_{t+n+1}|S_{t+1}, \mathcal{Y}_{t+n};\theta}(y_{t+n+1}|i,\vec{y}_{t+n}). \end{align*}\] Once this is done it can be seen that: \[\begin{align*} P_{\theta}(S_{t} = j|S_{t+1} = i,\mathcal{Y}_{t+m} = \vec{y}_{t+m}) = P_{\theta}(S_{t} = j|S_{t+1} = i,\mathcal{Y}_{t} = \vec{y}_{t}). \end{align*}\] From this follows the original claim: \[\begin{align*} P_{\theta}(S_{t} =j|S_{t+1}=i,\mathcal{Y}_{T} = \vec{y}_{T}) = P_{\theta}(S_{t} = j|S_{t+1} = i, \mathcal{Y}_{t} = \vec{y}_{t}). \end{align*}\] Next we can see that: \[\begin{align*} \displaystyle P_{\theta}(S_{t} =j|S_{t+1} = i,\mathcal{Y}_{t} = \vec{y}_{t}) &= \cfrac{P_{\theta}(S_{t}=j,S_{t+1} = i|\mathcal{Y}_{t} = \vec{y}_{t})}{P_{\theta}(S_{t+1} = i|\mathcal{Y}_{t} = \vec{y}_{t})} \\ &= \cfrac{P_{\theta}(S_{t+1} = i|S_{t}=j,\mathcal{Y}_{t} = \vec{y}_{t})P_{\theta}(S_{t}=j|\mathcal{Y}_{t} = \vec{y}_{t})}{P_{\theta}(S_{t+1} = i|\mathcal{Y}_{t} = \vec{y}_{t})} \\ &= \cfrac{P_{\theta}(S_{t+1} = i|S_{t}=j)P_{\theta}(S_{t}=j|\mathcal{Y}_{t} = \vec{y}_{t})}{P_{\theta}(S_{t+1} = i|\mathcal{Y}_{t} = \vec{y}_{t})} \\ &= \cfrac{\pi_{j,i}P_{\theta}(S_{t}=j|\mathcal{Y}_{t} = \vec{y}_t)}{P_{\theta}(S_{t+1} = i|\mathcal{Y}_{t} = \vec{y}_{t})}. \end{align*}\] From this it follows that: \[\begin{align*} \displaystyle P_{\theta}(S_{t} = j,S_{t+1} =i|\mathcal{Y}_{T} = \vec{y}_{T}) &= P_{\theta}(S_{t+1} = i|\mathcal{Y}_{T} = \vec{y}_{T})P_{\theta}(S_{t} = j|S_{t+1} = i,\mathcal{Y}_{T} =\vec{y}_{T}) \\ &= P_{\theta}(S_{t+1} = i|\mathcal{Y}_{T} = \vec{y}_{T})P_{\theta}(S_{t} = j|S_{t+1} = i,\mathcal{Y}_{t} = \vec{y}_{t}) \\ &= P_{\theta}(S_{t+1} = i|\mathcal{Y}_{T} = \vec{y}_{T})\cfrac{\pi_{j,i}P_{\theta}(S_{t}=j|\mathcal{Y}_{t} = \vec{y}_{t})}{P_{\theta}(S_{t+1} = i|\mathcal{Y}_{t} = \vec{y}_{t})}. \end{align*}\] Therefore, the smoothed inference over \(S_t\) is given by: \[\begin{align*} \displaystyle P_{\theta}(S_{t} = j|\mathcal{Y}_{T} = \vec{y}_{T}) &= \sum_{i = 1}^{N}P_{\theta}(S_t =j,S_{t+1} = i|\mathcal{Y}_{T} = \vec{y}_{T}) \\ &= \sum_{i = 1}^{N}P_{\theta}(S_{t+1} = i|\mathcal{Y}_{T} = \vec{y}_{T})\cfrac{\pi_{j,i}P_{\theta}(S_{t}=j|\mathcal{Y}_{t} = \vec{y}_{t})}{P_{\theta}(S_{t+1} = i|\mathcal{Y}_{t} = \vec{y}_{t})} \\ &= P_{\theta}(S_t = j|\mathcal{Y}_{t} = \vec{y}_{t})\sum_{i = 1}^{N}\cfrac{\pi_{j,i}P_{\theta}(S_{t+1}=i|\mathcal{Y}_{T} = \vec{y}_{T})}{P_{\theta}(S_{t+1} = i|\mathcal{Y}_{t} = \vec{y}_{t})} \\ &= P_{\theta}(S_t = j|\mathcal{Y}_{t} = \vec{y}_{t})(\pi_{j,1} ... \pi_{j,N})\begin{pmatrix} \cfrac{P_{\theta}(S_{t+1} = 1|\mathcal{Y}_{T} = \vec{y}_{T})}{P_{\theta}(S_{t+1} = 1|\mathcal{Y}_{t} = \vec{y}_{t})}\\ ...\\ \cfrac{P_{\theta}(S_{t+1} = N|\mathcal{Y}_{T} = \vec{y}_{T})}{P_{\theta}(S_{t+1} = N|\mathcal{Y}_{t} = \vec{y}_{t})} \end{pmatrix} \\ &= P_{\theta}(S_{t} =j|\mathcal{Y}_{t} = \vec{y}_{t})\Pi_{j,}(\hat{\zeta}_{t+1|T} (\div) \hat{\zeta}_{t+1|t}). \end{align*}\] Where \(\Pi_{j,}\) is the jth row of \(\Pi\). For the vector of probabilities one can therefore write: \[\begin{align*} \hat{\zeta}_{t|T} = \hat{\zeta}_{t|t} \odot \Pi(\hat{\zeta}_{t+1|T} (\div) \hat{\zeta}_{t+1|t}). \end{align*}\] This is the earlier presented formula.
EM Algorithm for Autoregressive Processes with finite lag order
The here presented proofs for the equations (38), (39) and (40) closely follow Hamilton (1990, page 63-67). We begin with (38), then go on to (39) and finish with the proof for (40). The first, essential step for all three proofs, is to establish that the following holds: \[\begin{equation} \begin{aligned} f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) = & f_{Y_T|Z_T;\alpha}(y_T|z_T) \cdot P_{\Pi}(S_T = s_T|S_{T-1} = s_{T-1}) \\ & \cdot f_{Y_{T-1}|Z_{T-1};\alpha}(y_{T-1}|z_{T-1}) \cdot P_{\Pi}(S_{T-1} = s_{T-1}|S_{T-2} = s_{T-2}) \\ & \cdot... \\ & \cdot f_{Y_{m+1}|Z_{m+1};\alpha}(y_{m+1}|z_{m+1}) \cdot P_{\Pi}(S_{m+1} = s_{m+1}|S_m = s_m) \\ & \cdot \rho_{s_m,...,s_1}. \end{aligned} \end{equation}\] This can be derived in the following way: \[\begin{align*} \displaystyle &f_{Y_T|Z_T;\alpha}(y_T|z_T) \cdot P_{\Pi}(S_T = s_T|S_{T-1} = s_{T-1}) \\& \cdot f_{Y_{T-1}|Z_{T-1};\alpha}(y_{T-1}|z_{T-1}) \cdot P_{\Pi}(S_{T-1} = s_{T-1}|S_{T-2} = s_{T-2})\\ & \cdot...\\ & \cdot f_{Y_{m+1}|Z_{m+1};\alpha}(y_{m+1}|z_{m+1}) \cdot P_{\Pi}(S_{m+1} = s_{m+1}|S_m = s_m)\\ & \cdot \rho_{s_m,...,s_1} \\ &= f_{Y_T|S_T,...,S_{T-m},Y_{T-1},...,Y_{T-m};\alpha}(y_T|s_T,...,s_{T-m},y_{T-1},...,y_{T-m}) P_{\Pi}(S_T = s_T|S_{T-1} = s_{T-1})\\ & \hspace{0.5cm} \cdot f_{Y_{T-1}|S_{T-1},...,S_{T-1-m},Y_{T-1-1},...,Y_{T-1-m};\alpha}(y_{T-1}|s_{T-1},...,s_{T-1-m},y_{T-1-1},...,y_{T-1-m}) \\& \hspace{0.5cm} \cdot P_{\Pi}(S_{T-1} = s_{T-1}|S_{T-2} = s_{T-2}) \\& \hspace{0.5cm} \cdot \dots \\ & \hspace{0.5cm}\cdot f_{Y_{m+2}|S_{m+2},...,S_2,Y_{m+1},...,Y_2;\alpha}(y_{m+2}|s_{m+2},...,s_2,y_{m+1},...,y_2) P_{\Pi}(S_{m+2} = s_{m+2}|S_{m+1} = s_{m+1})\\ & \hspace{0.5cm}\cdot f_{Y_{m+1}|S_{m+1},...,S_1,Y_{m},...,Y_1;\alpha}(y_{m+1}|s_{m+1},...,s_1,y_{m},...,y_1) P_{\Pi}(S_{m+1} = s_{m+1}|S_{m} = s_{m})\\ & \hspace{0.5cm}\cdot P_{\lambda}(S_m = s_m,...,S_1 = s_1|Y_m = y_m,...,Y_1 = y_1). \end{align*}\] And due to the Markov property and (12) it holds that: \[\begin{align*} \displaystyle & f_{Y_{m+1}|S_{m+1},...,S_1,Y_{m},...,Y_1;\alpha}(y_{m+1}|s_{m+1},...,s_1,y_{m},...,y_1) P_{\Pi}(S_{m+1} = s_{m+1}|S_{m} = s_{m}) \\ &= f_{Y_{m+1}|S_{m+1},...,S_1,Y_{m},...,Y_1;\alpha}(y_{m+1}|s_{m+1},...,s_1,y_{m},...,y_1) P_{\Pi}(S_{m+1} = s_{m+1}|S_{m} = s_{m},...,S_1 = s_1) \\ &= f_{Y_{m+1}|S_{m+1},...,S_1,Y_{m},...,Y_1;\alpha}(y_{m+1}|s_{m+1},...,s_1,y_{m},...,y_1) \\& \hspace{0.5cm} \cdot \ P_{\Pi}(S_{m+1} = s_{m+1}|S_{m} = s_{m},...,S_1 = s_1,Y_m = y_m,...,Y_1 = y_1) \\ &= f_{Y_{m+1},S_{m+1}|S_m,...,S_1,Y_m,...,Y_1;\theta}(y_{m+1},s_{m+1}|s_m,...,s_1,y_m,...,y_1). \end{align*}\] Logically it also holds that: \[\begin{align*} \displaystyle & f_{Y_{m+1},S_{m+1}|S_m,...,S_1,Y_m,...,Y_1;\theta}(y_{m+1},s_{m+1}|s_m,...,s_1,y_m,...,y_1) \cdot P_{\lambda}(S_m = s_m,...,S_1 = s_1|Y_m = y_m,...,Y_1 = y_1) \\& = f_{Y_{m+1},S_{m+1},S_{m},...,S_1|Y_m,...,Y_1;\theta}(y_{m+1},s_{m+1},...,s_1|y_m,...,y_1). \end{align*}\] We assumed in our model formulation in 4.1 that there is a maximal autoregressive lag order \(m\) such that \(Y_t\), depends only on \(m\) lags of \(Y_t\). Then it holds that: \[\begin{align*} &f_{Y_{m+2}|S_{m+2},...,S_2,Y_{m+1},...,Y_2;\alpha}(y_{m+2}|s_{m+2},...,s_2,y_{m+1},...,y_2) P_{\Pi}(S_{m+2} = s_{m+2}|S_{m+1} = s_{m+1}) \\ &= f_{Y_{m+2}|S_{m+2},...,S_2,S_1,Y_{m+1},...,Y_2,Y_1;\alpha}(y_{m+2}|s_{m+2},...,s_2,s_1,y_{m+1},...,y_2,y_1) P_{\Pi}(S_{m+2} = s_{m+2}|S_{m+1} = s_{m+1}) \\ &= f_{Y_{m+2}|S_{m+2},...,S_2,S_1,Y_{m+1},...,Y_2,Y_1;\alpha}(y_{m+2}|s_{m+2},...,s_2,s_1,y_{m+1},...,y_2,y_1) \\& \hspace{0.5cm} \cdot P_{\Pi}(S_{m+2} = s_{m+2}|S_{m+1} = s_{m+1},S_m = s_m,...,S_1 = s_1,Y_{m+1} = y_{m+1},...,Y_1 = y_1) \\ &= f_{Y_{m+2},S_{m+2}|S_{m+1},S_m,...,S_1,Y_{m+1},Y_m,...,Y_1;\theta}(y_{m+2},s_{m+2}|s_{m+1},s_m,...,s_1,y_{m+1},y_m,...,y_1). \end{align*}\] And thus we can write: \[\begin{align*} \displaystyle &f_{Y_{m+2},S_{m+2}|S_{m+1},S_m,...,S_1,Y_{m+1},Y_m,...,Y_1;\theta}(y_{m+2},s_{m+2}|s_{m+1},s_m,...,s_1, y_{m+1},y_m,...,y_1) \\ &\cdot f_{Y_{m+1},S_{m+1},S_{m},...,S_1|Y_m,...Y_1;\theta}(y_{m+1},s_{m+1},s_{m},...,s_1|y_m,...,y_1) \\ &= f_{Y_{m+2},S_{m+2}|S_{m+1},Y_{m+1},S_m,...,S_1,Y_m,...,Y_1;\theta}(y_{m+2},s_{m+2}|s_{m+1},y_{m+1},s_m,...,s_1,y_m,...,y_1) \\& \hspace{0.5cm} \cdot f_{Y_{m+1},S_{m+1},S_{m},...,S_1|Y_m,...Y_1;\theta}(y_{m+1},s_{m+1},s_{m},...,s_1|y_m,...,y_1) \\ &= f_{Y_{m+2},S_{m+2},Y_{m+1},S_{m+1},S_m,...S_1|Y_{m},...Y_1;\theta}(y_{m+2},s_{m+2},y_{m+1},s_{m+1}, s_m,...,s_1|y_{m},...,y_1). \end{align*}\] We can follow this logic until \(T\) and end up with: \[\begin{align*} \displaystyle & f_{Y_T|S_T,...,S_{T-m},Y_{T-1},...,Y_{T-m};\alpha}(y_T|s_T,...,s_{T-m},y_{T-1},...,y_{T-m})P_{\Pi}(S_T = s_T|S_{T-1} = s_{T-1})\\ &\cdot f_{Y_{T-1}|S_{T-1},...,S_{T-1-m},Y_{T-1-1},...,Y_{T-1-m};\alpha}(y_{T-1}|s_{T-1},...,s_{T-1-m}, y_{T-1-1},...,y_{T-1-m}) \\& \cdot P_{\Pi}(S_{T-1} = s_{T-1}|S_{T-2} = s_{T-2})\\ & \cdot \dots\\ &\cdot f_{Y_{m+2}|S_{m+2},...,S_2,Y_{m+1},...,Y_2;\alpha}(y_{m+2}|s_{m+2},...,s_2, y_{m+1},...,y_2)P_{\Pi}(S_{m+2} = s_{m+2}|S_{m+1} = s_{m+1})\\ &\cdot f_{Y_{m+1}|S_{m+1},...,S_2,Y_{m},...,Y_1;\alpha}(y_{m+1}|s_{m+1},...,s_2,y_{m},...,y_1) P_{\Pi}(S_{m+1} = s_{m+1}|S_m = s_{m})\\ &\cdot P_{\lambda}(S_m = s_m,...,S_1 = s_1|Y_m = y_m,...,Y_1 = y_1) \\ &= f_{Y_T,S_T,Y_{T-1},S_{T-1},...,Y_{m+1},S_{m+1},S_m,...,S_1|Y_m,...,Y_1;\lambda}(y_T,s_T,y_{T-1}, s_{T-1},...,y_{m+1},s_{m+1}, s_m,..., s_1| y_m,..., y_1) \\ &= f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m). \end{align*}\]
Now that this first step has been established, we can now focus on the derivation of the first equation, thereby we are closely following Hamilton (1990, page 63-65). We start with (55), it holds that: \[\begin{equation} \begin{aligned} \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)) &= \ln(f_{Y_T|Z_T;\alpha}(y_T|z_T)) \\& \hspace{0.5cm} + \ln(P_{\Pi}(S_T = s_T|S_{T-1} = s_{T-1}))\\ & \hspace{0.5cm}+ \ln(f_{Y_{T-1}|Z_{T-1};\alpha}(y_{T-1}|z_{T-1})) \\& \hspace{0.5cm} + \ln(P_{\Pi}(S_{T_1} = s_{T-1}|S_{T-2} = s_{T-2}))\\ &\hspace{0.5cm} + \dots \\ &\hspace{0.5cm} + \ln(f_{Y_{m+1}|Z_{m+1};\alpha}(y_{m+1}|z_{m+1})) \\&\hspace{0.5cm} + \ln(P_{\Pi}(S_{m+1} = s_{m+1}|S_m = s_m))\\ &\hspace{0.5cm} + \ln(\rho_{s_m,\dots,s_1}). \end{aligned} \end{equation}\] We remember that if we have \(L(x,y)\) and \(\ln(L(x,y)) = l(x,y)\), then it is true that: \[\begin{align*} \cfrac{\partial l(x,y)}{\partial x} = \cfrac{1}{L(x,y)}\cfrac{\partial L(x,y)}{\partial x}, \end{align*}\] and thus \[\begin{align*} \cfrac{\partial L(x,y)}{\partial x} = \cfrac{\partial l(x,y)}{\partial x}L(x,y). \end{align*}\] We apply this now: \[\begin{align*} \displaystyle &f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial \pi_{i,j}} \\ &= \cfrac{\partial f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{\partial \pi_{i,j}}, \end{align*}\] where: \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial \pi_{i,j}} = \sum_{t = m+1}^{T}\cfrac{\partial \ln(P_{\Pi}(S_t = s_t|S_{t-1} = s_{t-1}))}{\partial \pi_{i,j}}. \end{align*}\] One should note that: \[\begin{align*} \displaystyle \cfrac{\partial \ln(P_{\Pi}(S_t = s_t|S_{t-1} = s_{t-1}))}{\partial \pi_{i,j}} = \begin{cases} \cfrac{1}{\pi_{i,j}}, & \text{if $S_t =j \quad \text{and} \quad S_{t-1} = i$} \\ 0, & \text{otherwise} \end{cases}. \end{align*}\] In the following, we will use the Kronecker delta as notation in the following way: \[\begin{align*} \delta_{[A]} = \begin{cases} 1, &\text{if A is true}\\ 0, &\text{otherwise} \end{cases}, \end{align*}\] thus: \[\begin{align*} \displaystyle \cfrac{\partial f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{\partial \pi_{i,j}} &= f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)\sum_{t = m+1}^{T}\cfrac{\partial \ln(P_{\Pi}(S_t = s_t|S_{t-1} = s_{t-1}))}{\partial \pi_{i,j}} \\ &= f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)\cfrac{1}{\pi_{i,j}}\sum_{t = m+1}^{T}\delta_{[S_t =j,S_{t-1}= i]}. \end{align*}\] We remember that the following holds: \[\begin{align*} \displaystyle Q_{\lambda_l,\vec{y}_{T}}(\lambda_{l+1}) = \sum_{\vec{s}_{T}} \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m). \end{align*}\] Therefore, we can say: \[\begin{align*} \displaystyle \cfrac{\partial Q_{\lambda_l,\vec{y}_{T}}(\lambda_{l+1})}{\partial \pi_{i,j}^{(l+1)}} &= \sum_{\vec{s}_{T}}\cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial \pi_{i,j}^{(l+1)}}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &= \sum_{\vec{s}_{T}}\cfrac{1}{\pi_{i,j}^{(l+1)}}\sum_{t = m+1}^{T}\delta_{[S_t = j, S_{t-1} = i]}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m). \end{align*}\] It is now essential to notice that: \[\begin{align*} \displaystyle \sum_{\vec{s}_{T}}\delta_{[S_t = j,S_{t-1} = i]}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) &= f_{\mathcal{Y}_{T:(m+1)},S_t, S_{t-1}|\mathcal{Y}_m;\lambda_l}(\vec{y}_{T:(m+1)},j,i) \\ &= P_{\lambda_l}(S_t = j, S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T}) \\& \hspace{0.5cm} \cdot f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m), \end{align*}\] and that therefore: \[\begin{align*} \displaystyle \cfrac{\partial Q_{\lambda_l,\vec{y}_{T}}(\lambda_{l+1})}{\partial \pi_{i,j}^{(l+1)}} &= \sum_{\vec{s}_{T}}\cfrac{1}{\pi_{i,j}^{(l+1)}}\sum_{t = m+1}^{T}\delta_{[S_t = j, S_{t-1} = i]} f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &=\cfrac{1}{\pi_{i,j}^{(l+1)}}\sum_{t = m+1}^{T}\sum_{\vec{s}_{T}}\delta_{[S_t = j, S_{t-1} = i]}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_l}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &= \cfrac{1}{\pi_{i,j}^{(l+1)}}\sum_{t = m+1}^{T}P_{\lambda_l}(S_t = j,S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T})f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m). \end{align*}\] Under the constraint \(\sum_{j = 1}^{N}\pi_{i,j} = 1\) we can now form the Lagrangian: \[\begin{align*} \displaystyle Q_{\lambda_l,\vec{y}_{T}}(\lambda_{l+1}) - \mu_i(\sum_{j = 1}^{N}\pi_{i,j} - 1). \end{align*}\] This leads to the following first-order conditons: \[\begin{align*} \cfrac{\partial Q_{\lambda_{l},\vec{y}_{T}}(\lambda_{l+1})}{\partial \pi_{i,j}^{(l+1)}} = \mu_i, \quad \text{for} \quad j = 1,...,N. \end{align*}\] We insert our result from above: \[\begin{align*} \displaystyle & \cfrac{1}{\pi_{i,j}^{(l+1)}}\sum_{t = m+1}^{T}P_{\lambda_l}(S_t = j, S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T})f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m) = \mu_i \\ &\Leftrightarrow \sum_{t = m+1}^{T}P_{\lambda_l}(S_t =j,S_{t-1}=i|\mathcal{Y}_T = \vec{y}_{T}) = \cfrac{\pi_{i,j}^{(l+1)}\mu_i}{f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)}. \end{align*}\] We now sum over \(1,...,N\), which leads to: \[\begin{align*} \displaystyle &\sum_{j = 1}^{N}\sum_{t = m+1}^{T}P_{\lambda_l}(S_t = j,S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T}) = \sum_{j = 1}^{N}\cfrac{\pi_{i,j}^{(l+1)}\mu_i}{f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)} \\ &\Leftrightarrow \sum_{t = m+1}^{T}P_{\lambda_l}(S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T}) = \cfrac{\mu_i}{f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)}. \end{align*}\] If we now insert this in the result from above we get: \[\begin{align*} &\sum_{t = m+1}^{T}P_{\lambda_l}(S_t =j,S_{t-1}=i|\mathcal{Y}_T = \vec{y}_{T}) = \cfrac{\pi_{i,j}^{(l+1)}\mu_i}{f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)} = \pi_{i,j}^{(l+1)}\sum_{t = m+1}^{T}P_{\lambda_l}(S_{t-1} = i|\mathcal{Y}_T = \vec{y}_{T}) \\ &\Leftrightarrow \pi_{i,j}^{(l+1)} = \cfrac{\sum_{t = m+1}^{T}P_{\lambda_l}(S_t =j,S_{t-1}=i|\mathcal{Y}_T = \vec{y}_{T})}{\sum_{t = m+1}^{T}P_{\lambda_l}(S_{t-1} = i|\mathcal{Y}_T =\vec{y}_{T})}, \end{align*}\] this concludes the derivation of (38). With that, we get to the second equation. In the following we closely follow Hamilton (1990, page 65-66). Again, we start with (55), but this time we take the derivative in \(\alpha\): \[\begin{align*} \displaystyle \cfrac{\partial f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)}{\partial \alpha} = f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)\sum_{t = m+1}^{T}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha}(y_t|z_t))}{\partial \alpha}. \end{align*}\] It is important to note here that \(f_{Y_t|Z_t;\alpha}(y_t|z_t)\) depends on \(S_t\) through \(Z_t\), because \ \(Z_t = (S_t,S_{t-1},...,S_{t-m},Y_{t-1},Y_{t-2},...,Y_{t-m})\), but at most for the dates \(t,...,t-m\), thus: \[\begin{align*} \displaystyle \cfrac{\partial Q_{\lambda_l,\vec{y}_{T}}(\lambda_{l+1})}{\partial \alpha_{l+1}} &= \cfrac{\partial}{\partial \alpha_{l+1}}\sum_{\vec{s}_{T}}\ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &= \sum_{\vec{s}_{T}}\cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial \alpha_{l+1}}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ & \stackrel{\text{(56)}}{=}\sum_{\vec{s}_{T}}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)\sum_{t = m+1}^{T}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha_{l+1}}(y_t|z_t))}{\partial \alpha_{l+1}} \\ &= \sum_{t = m + 1}^{T}\sum_{\vec{s}_T}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha_{l+1}}(y_t|z_t))}{\partial \alpha_{l+1}}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &= \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha_{l+1}}(y_t|z_t))}{\partial \alpha_{l+1}} \\& \hspace{0.5cm}\cdot \left(\sum_{s_{T} = 1}^{N}\cdots\sum_{s_{t+1} = 1}^{N}\sum_{s_{t-m-1} = 1}^{N}\cdots\sum_{s_{1} = 1}^{N}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m)\right) \\ &= \sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N}\cfrac{\partial \ln(f_{Y_t|Z_t;\alpha_{l+1}}(y_t|z_t))}{\partial \alpha_{l+1}} \\ & \hspace{0.5cm} \cdot P_{\lambda_l}(S_t = s_t,...,S_{t-m} = s_{t-m}|Y_T = y_T,...,Y_1 = y_1) \\& \hspace{0.5cm} \cdot f_{Y_T,...,Y_{m+1}|Y_m,...,Y_1;\lambda_l}(y_T,...,y_{m+1}|y_m,...,y_1). \end{align*}\] These steps are possible because \(f_{Y_t|Z_t;\alpha}(y_t|z_t)\) at most only depends on \(S_t,...,S_{t-m}\). This leads us to the following first order condition: \[\begin{align*} \displaystyle &f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)\sum_{t = m+1}^{T}\sum_{s_t = 1}^{N}\cdots\sum_{s_{t-m} = 1}^{N} \cfrac{\partial \ln(f_{Y_t|Z_t;\alpha_{l+1}}(y_t|z_t))}{\partial \alpha_{l+1}} \\& \cdot P_{\lambda_l}(S_t = s_t,S_{t-1} = s_{t-1},...,S_{t-m} = s_{t-m}|\mathcal{Y}_T = \vec{y}_{T}) = 0, \end{align*}\] which is equivalent to (39).Now we can turn to equation number three, here we closely follow the derivations presented by Hamilton (1990, page 66-67). We start with (56) and take the derivative in \(\rho_{i_m,...,i_1}\): \[\begin{align*} \displaystyle \cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial \rho_{i_m,...,i_1}} = \cfrac{1}{\rho_{i_m,...,i_1}}\cdot\delta_{[S_m = i_m,...,S_1 = i_1]}, \end{align*}\] because of this, it holds that:\ \[\begin{align*} \displaystyle \cfrac{\partial Q_{\lambda_l,\vec{y}_{T}}(\lambda_{l+1})}{\partial\rho_{i_m,...,i_1}^{(l+1)}} &= \sum_{\vec{s}_T}\cfrac{\partial \ln(f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l+1}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m))}{\partial \rho_{i_m,...,i_1}^{(l+1)}}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) \\ &= \sum_{\vec{s}_{T}}\cfrac{1}{\rho_{i_m,...,i_1}^{(l+1)}}\delta_{[S_m = i_m,...,S_1 = i_1]}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m). \end{align*}\] We want to optimize \(Q_{\lambda_l,\vec{y}_{T}}(\lambda_{l+1})\) under the constraint \(\sum_{j = 1}^{N^m}(\rho_{l+1})_j = 1\), i.e that the sum of all elements of \(\rho_{l+1}\) shall be 1. Thus we construct the Lagrangian: \[\begin{align*} \displaystyle Q_{\lambda_l,\vec{y}_{T}}(\lambda_{l+1}) - \mu(\sum_{j = 1}^{N^m}(\rho_{l+1})_j-1)). \end{align*}\] Which leads to the first order condition: \[\begin{align*} \displaystyle &\sum_{\vec{s}_{T}} \cfrac{1}{\rho_{i_m,...,i_1}^{(l+1)}}\delta_{[S_m = i_m,...,S_1 = i_1]}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) = \mu \\ &\Leftrightarrow \sum_{\vec{s}_T}\delta_{[S_m = i_m,...,S_1 = i_1]}f_{\mathcal{Y}_{T:(m+1)},\mathcal{S}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)},\vec{s}_T|\vec{y}_m) = \rho_{i_m,...,i_1}^{(l+1)}\mu \\ &\Leftrightarrow f_{S_m,...,S_1,Y_T,...,Y_{m+1}|Y_m,...,Y_1;\lambda_l}(i_m,...,i_1,y_T,...,y_{m+1}|y_m,...,y_1) = \rho_{i_m,...,i_1}^{(l+1)}\mu \\ &\Leftrightarrow P_{\lambda_l}(S_m = i_m,...,S_1 = i_1|\mathcal{Y}_T =\vec{y}_{T})f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m) = \mu \rho_{i_m,...,i_1}^{(l+1)}. \end{align*}\] If we now sum over all potential values of \((i_1,...i_m)\) we end up with: \[\begin{align*} \displaystyle &\sum_{i_m = 1}^{N}\cdots\sum_{i_1 = 1}^{N}P_{\lambda_l}(S_m = i_m,...,S_1 = i_1|\mathcal{Y}_T = \vec{y}_{T})f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m) = \sum_{i_m = 1}^{N}\cdots\sum_{i_1 = 1}^{N}\mu \rho_{i_m,...,i_1}^{(l+1)} \\ &\Leftrightarrow f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m) = \mu. \end{align*}\] We insert this for \(\mu\) and get: \[\begin{align*} \displaystyle &P_{\lambda_l}(S_m = i_m,...,S_1 = i_1|\mathcal{Y}_T = \vec{y}_{T})f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m) = f_{\mathcal{Y}_{T:(m+1)}|\mathcal{Y}_m;\lambda_{l}}(\vec{y}_{T:(m+1)}|\vec{y}_m)\rho_{i_m,...,i_1}^{(l+1)} \\ &\Leftrightarrow P_{\lambda_l}(S_m = i_m,...,S_1 = i_1|\mathcal{Y}_T = \vec{y}_{T}) = \rho_{i_m,...,i_1}^{(l+1)}. \end{align*}\] This concludes the derivation of the EM algorithm for models with an underlying Markov-Chain and a dependence on a maximum lag order.