






































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A parametric statistical model with an unknown parameter vector and an i.i.d. sequence. The model assumes that the data is generated by a fully parametric model, and it provides conditions for the consistency and asymptotic distribution of the maximum likelihood estimator (MLE). The document also covers discretely-observed jump-diffusions, specifically the Cox-Ingersoll-Ross model, and derives the likelihood function and its properties.
Typology: Study notes
1 / 46
This page cannot be seen from the preview
Don't miss anything!
Abstract We propose an easy-to-implement simulated maximum likelihood estimator for dynamic models where no closed-form representation of the likelihood function is available. Our method can handle any simulable model without latent dynamics. Using simulated observations, we non- parametrically estimate the unknown density by kernel methods, and then construct a likelihood function that can be maximized. We prove that this nonparametric simulated maximum likeli- hood (NPSML) estimator is consistent and asymptotically e¢ cient. The higher-order impact of simulations and kernel smoothing on the resulting estimator is also analyzed; in particular, it is shown that the NPSML does not su§er from the usual curse of dimensionality associated with kernel estimators. A simulation study shows good performance of the method when employed in the estimation of jump-di§usion models.
We thank the seminar participants at Berkeley, BU, Brown, Columbia, LSE, NYU, Rice, and Stanford for many useful comments. We also thank the referees who o§ered exceptionally thorough and helpful comments. Kyu-Chul Jung provided excellent research assistance. Kristensen gratefully acknowledges the Önancial support of the National Science Foundation (SES-0961596) and of the Danish Research Foundation (through a grant to CREATES). y Department of Economics, Columbia University and CREATES, Aarhus University (e-mail: dk2313@columbia.edu z ). Department of Economics, Washington University in St. Louis and Federal Reserve Bank of St. Louis (e-mail: yshin@wustl.edu).
1 Introduction
We propose a simulated maximum likelihood estimator for dynamic models based on nonparamet- ric kernel methods. Our method is designed for models where no closed-form representation of the likelihood function is available. Our method can handle any simulable model without latent dy- namics. For any given parameter value, conditioning on available past information, we draw N i.i.d. simulated observations from the model. We then use these simulated observations to nonparamet- rically estimate the conditional densityó unknown in closed formó by kernel methods. The kernel estimate converges to the true conditional density as N goes to inÖnity, enabling us to approxi- mate the true density arbitrarily well with a su¢ ciently large N. We then construct the likelihood and search over the parameter space to obtain a maximum likelihood estimatoró nonparametric simulated maximum likelihood estimator (NPSMLE). NPSML was introduced by Fermanian and SalaniÈ (2004), who obtained theoretical results only for static models. In this paper, we generalize their method to dynamic models, including nonstationary and time-inhomogeneous processes. We give general conditions for the NPSMLE to be consistent and have the same asymptotic distribution as the infeasible maximum likelihood estimator (MLE). For the stationary case, we also analyze the impact of simulations on the bias and variance of the NPSMLE. In particular, we show that the estimator does not su§er from the curse of dimensionality despite the use of kernel smoothers. Finally, we show that the theoretical results remain valid even if only simulations from an approximate model are available. NPSML can be used for estimating general classes of models, such as structural Markov decision processes and discretely-sampled di§usions. As for Markov decision processes, the transition density of endogenous state variables embodies an optimal policy function of a dynamic programming problem, and hence does not typically have a closed-form representation (Rust, 1994; Doraszelski and Pakes, 2007). However, we can closely approximate the optimal policy function numerically, and simulate observations from the model for NPSML. Similarly, as for the estimation of continuous-time stochastic models with discretely-sampled data, the transition densities are well-deÖned, but only in few special cases can we derive closed-form expressions for them. Again, a large class of continuous- time processes, including jump-di§usions, can be approximated with various discretization schemes to a given level of precision, and we can simulate observations from the model which are then used for NPSML. Indeed, we investigate the performance of NPSML when applied to jump-di§usion models with particular attention to the impact of number of simulations and bandwidth. We Önd that NPSML performs well even for a moderate number of simulations and that it is quite robust to the choice of bandwidth. For the classes of models that NPSML addresses, there are two categories of existing approaches. The Örst is based on moment matching, and includes simulated methods of moments (Lee and In- gram, 1991; Du¢ e and Singleton, 1993; Creel and Kristensen, 2009), indirect inference (GouriÈroux et al., 1993; Smith, 1993; Creel and Kristensen, 2011), and e¢ cient methods of moments (Gal- lant and Tauchen, 1996). These are all general-purpose methods, but cannot attain asymptotic
2 Nonparametric Simulated Maximum Likelihood
Suppose that we have T observations, f(yt; xt)gTt=1, yt 2 Rk^ and xt 2 Xt. The space Xt can be time-varying. We assume that the data is generated by a fully parametric model:
yt = gt (xt; "t; ) ; t = 1; ; T; (1)
where 2 Rd^ is an unknown parameter vector, and "t is and i.i.d. sequence with known distribution F" and independent of xt and "t. Without loss of generality, assume that F" is known and does not depend on t and . Our setting accommodates Markov models where xt yt 1 , such that fytg is a (possibly time-inhomogeneous) Markov process. In this case (1) is a fully-speciÖed model. However, we allow xt to contain other (exogenous) variables than lagged yt, in which case (1) is only a partially-speciÖed model. Also, we allow the processes (yt; xt) to be nonstationary, for example due to unit-root-type behavior or deterministic time trends. The model is assumed to have an associated conditional density pt(yjx; ). That is,
P (yt 2 Ajxt = x) =
A
pt(yjx; )dy; t = 1; ; T;
for any Borel set A Rk. A natural estimator of is then the maximizer of the conditional log-likelihood:
~ = arg max 2 LT^ ();^ LT^ () =
t=
log pt(ytjxt; ):
If the model (1) is fully speciÖed, i.e. xt only contains lagged yt, then this is the full likelihood of the model conditional on the starting value. If on the other hand, xt contains other variables than lagged yt, LT () is a partial likelihood. Suppose now that pt(yjx; ) does not have a closed-form representation, and thus the maximum likelihood estimation of is not feasible. In terms of the model (1), this occurs when either the inverse of gt(xt; "t; ) w.r.t. "t does not exist, or when the inverse does not have a closed-form expression.^1 Such a situation may arise, for example, when the function g involves a solution to a dynamic programming problem, or when we are dealing with discretely-sampled jump-di§usions. In such cases, although pt(yjx; ) is not available in closed form, we are still able to generate simulated observations from the model: A solution to a dynamic programming problem can be represented numerically, and a jump-di§usion can be approximated by various discretization schemes up to a given level of precision. We here propose a general method to obtain a simulated conditional density, which in turn will be used to obtain a simulated version of the MLE. For any given 1 t T , yt 2 Rk, xt 2 Xt, and
(^1) If the inverse has a closed-form expression, we have pt(yjx; ) = p"^ g t 1 (y; x; )^ @g t (^1) @y(y;x ;), and the likelihood is easily evaluated.
2 , we wish to compute a simulated version of pt(ytjxt; ). To this end, we Örst generate N i.i.d. draws from F", f"igNi=1, through a random number generator, and use these to compute
Y (^) t;i = gt (xt; "i; ) ; i = 1; ; N:
By construction, the N simulated i.i.d. random variables, fY (^) t;igNi=1, follow the target distribution: Y (^) t;i pt(jxt; ), i = 1; ; N. They can therefore be used to estimate pt(yjx; ) with kernel methods. DeÖne:
p ^t(ytjxt; ) =^1 N
i=
Kh(Y (^) t;i yt); (2)
where Kh() = K(=h)=hk, K : Rk^7! R is a kernel, and h > 0 a bandwidth.^2 Under regularity conditions on pt and K, we obtain:
p ^t(ytjxt; ) = pt(ytjxt; ) + OP (1=
p N hk) + OP (h^2 ); N! 1;
where the remainder terms are oP (1) if h! 0 and N hk^! 1. Once (2) has been used to obtain the simulated conditional density, we can now construct the following simulated MLE of 0 :
^ = arg max 2
t=
log ^pt(ytjxt; ):
When searching for ^ through numerical optimization, we use the same draws for all values of . We may also use the same batch of draws from F"(), f"igNi=1, across di§erent values of t and x. Numerical optimization is facilitated if L^T () is continuous and di§erentiable in . With (2), if K and 7! gt (x; "; ) are r 0 times continuously di§erentiable, then L^T () has the same property. This follows from the chain rule and the fact that we use the same random draws f"igNi=1 for all values of . Since p^t(ytjxt; ) !P pt(ytjxt; ), L^T () !P LT () as N! 1 for a given T 1 under regularity conditions. The main theoretical results of this paper demonstrate that ^ inherits the properties of the infeasible MLE, ~, as T; N! 1, under suitable conditions. The precision of ^ relative to ~ clearly depends on the quality of the approximation of pt(yjx; ) by p^t(yjx; ). Let us note the following important points concerning the impact of the simulated density. Firstly, because we use i.i.d. draws, the density estimator is not a§ected by the dependence structure in the observed data. In particular, our estimator works whether the observed data are i.i.d. or nonstationary. Secondly, the simulated density, p^t(yjx; ), su§ers from the usual curse of dimensionality for kernel density estimators with its variance being of order 1 =(N hk). The curse of dimensionality only depends on k dim(yt) here since we do not smooth over xt, and so the dimension of xt is irrelevant in itself. Still one could be concerned that for high-dimensional (^2) Here and in the following, we will use K to denote a generic kernel.
the wavelet estimator of Donoho et al. (1996).
Example: Discretely-Observed Jump Di§usion Consider an Rk-dimensional continuous- time stochastic process fyt : t 0 g that solves the following stochastic di§erential equation:
dyt = (t; yt; ) dt + (t; yt; ) dWt + JtdQt: (3)
The model contains both continuous and jump components. Wt 2 Rl^ is a standard Brownian motion, while Qt is an independent pure jump process with stochastic intensity (t; yt; ) and jump size 1. The functions : [0; 1 ) Rk^ 7! Rk^ and : [0; 1 ) Rk^ 7! Rkl^ is the drift and the di§usion term respectively, while Jt measures the jump sizes and has density v (t; yt; ). Such jump di§usions are widely used in Önance to model the dynamics of stock prices, interest rates, exchange rates and so on (Sundaresan, 2000). Suppose we have a sample y 1 ; :::; yT ó without loss of generality, we normalize the time interval between observations to 1ó and wish to estimate by maximum likelihood. Although under regularity conditions (Lo, 1988) the transition density pt(yjx; ) satisfying P (yt+1 2 Ajyt = x) =
A pt(yjx;^ )dy^ is well-deÖned, it cannot in general be written in closed form which in turn complicates estimation.^5 However, discretization schemes (Kloeden and Platen, 1992; Bruti-Liberati and Platen, 2007) can be used to simulate observations from the model for any given level of accuracy, enabling NPSML. We re-visit this example in Section 4 where we provide a detailed description of implementing NPSML in practice.
Latent Dynamics Our method can be modiÖed to handle dynamic latent variables: Suppose yt is generated from
[yt; wt] = g (yt 1 ; wt 1 ; "t; ) ;
where wt is unobserved/latent and "t i.i.d. F". The full likelihood function will require com- putation of conditional densities on the form p (ytjyt 1 ; yt 2 ; :::; y 0 ; ) which in general is compli- cated due to the expanding information set; see e.g. Brownlees et al. (2011). We can however construct a simulated version of the following "limited information" likelihood (LIL) given by LT () = Pt t=1 log^ p^ (ytjxt;^ )^ where^ xt^ is a set of conditioning variables chosen by the econometri- cians, say, xt = (yt 1 ; :::; yt m) for some m 1. There will be an e¢ ciency loss from estimating using this LIL relative to the full likelihood, but the LIL is a lot easier to implement: First simulate a (long) trajectory fY (^) t g Nt~=1 by h Y (^) t ; W (^) t
i = g
Y (^) t 1 ; W (^) t 1 ; "t;
; t = 1; ::::; N~ , (^5) Schaumburg (2001) and Yu (2007), building on the approach of AÔt-Sahalia (2002), use analytic expansions to approximate the transition density for univariate and multivariate jump di§usions, respectively. Their asymptotic result requires that the sampling interval shrink to zero. The simulated MLE of Pedersen (1995a,b) or Brandt and Santa-Clara (2002) need to be substantially modiÖed before they can be applied to LÈvy processes.
where f"tg Nt~=1 are i.i.d. draws from F". We can then use these simulations to construct a simulated version of p (ytjxt; ) by the following kernel estimator of the conditional density,
p (yjx; ) =
t=1 Kh(Y^ t ^ y)Kh(Xt ^ x) P (^) N~ t=1 Kh(Xt ^ x)
where Xt =
Y (^) t 1 ; :::; Y (^) t m
. Similar ideas were utilized in Altissimo and Mele (2009) and Creel and Kristensen (2009). A disadvantage of the above method is that the convergence of p relative to p^ will be slower due to (i) the dimension of (Y (^) t ; Xt ) can potentially be quite large and (ii) the simulated variables are now dependent. So one will have to choose a larger N~ for the simulated conditional density in (4) relative to the one in (2). To handle (ii), one will typically have to assume a stationary solution to the dynamic system under consideration, and either have to start the simulation from the stationary distribution, or assume that the simulated process converges towards the stationary distribution at a suitable rate. For the latter to hold, one will need to impose some form of mixing condition on the process, as in Altissimo and Mele (2009) and Creel and Kristensen (2009). Then a large value of N~ is needed to ensure that the simulated process is su¢ ciently close to its stationary distributionó that is, one has to allow for a burn-in. The estimator in (4) may work under nonstationarity as well. Recently, a number of papers have considered kernel estimation of nonstationary Markov processes. The kernel estimator proves to be consistent and asymptotically mixed-normally distributed when the Markov process is recurrent (Karlsen and Tj¯stheim, 2001; Bandi and Phillips, 2003). However, the convergence rate will be path-dependent and relatively slow. In the remainder of this paper we focus on (2). The properties of (4) can be obtained by following the same strategy of proof as the one we employ for (2). The only di§erence is that, to obtain p !P p in the sup-norm one has to take into account the dependence of the simulated values. This can be done along the lines of Creel and Kristensen (2009) where kernel regressions and simulations are combined to compute GMM estimators for dynamic latent variable models.
Discrete Random Variables Discrete random variables can be accommodated within our framework. Suppose yt contains both continuous and discrete random variables. For example, yt = (y 1 t; y 2 t) 2 Rk+l^ where y 1 t 2 Rk^ is a continuous random variable while y 2 t 2 Y 2 Rl^ is a ran- dom variable with (potentially inÖnite number of) discrete outcomes, Y 2 = fy 2 ; 1 ; y 2 ; 2 ; :::g. We could then use a mixed kernel to estimate pt (yjx). For given simulated observations Y (^) t;i =
Y 1 t;i; Y 2 t;i
i = 1; :::; N :
p ^t(y 1 ; y 2 jx; ) =
i=
Kh(Y 1 t;i y 1 )IfY 2 t;i = y 2 tg; (y 1 t; y 2 t) 2 Rk+l; (5)
where Ifg is the indicator function and K : Rk^7! R is the kernel from before. However, the resulting simulated log-likelihood will be discontinuous and optimization may be di¢ cult. One
In all three cases, we can write the resulting simulated joint density in equation (7) by choosing Y 2 t;i = K(1)( Z~t;i; y 2 tjxt), Y 2 t;i = K b(2)
Zt;i; y 2 t
and Y 2 t;i = K^(2) b
Zt;i; y 2 t
, respectively. Here, 7! Y 2 t;i is smooth with a bias that disappears as b! 0 and variance that is bounded in b. Thus, the order of the variance of L^T () is not a§ected by any added discrete variables, and the curse of dimensionality remains of order k = dim (y 1 t).
Quasi Maximum Likelihood Estimation The use of our approximation method is not limited to actual MLEs. In many situations, one can deÖne quasi- or pseudo-likelihood which, even though it is not the true likelihood, identiÖes the parameters of the true model. One obvious example of this is the standard regression model, where the MLE based on Gaussian errors (i.e. the least- squares estimator) proves to be robust to deviations from the normality assumption. Another example is estimation of (G)ARCH models using quasi-maximum likelihoodó e.g. Lee and Hansen (1994). These are cases where the quasi-likelihood can be written explicitly. If one cannot Önd explicit expressions of the quasi-likelihood, one can instead employ our estimator, simulating from the quasi-model: Suppose for example that data has been generated by the model (1), but the data- generating distribution of the errors is unknown. We could then choose a suitable distribution F", draw f"igNi=1 from F" and then proceed as in Section 2.1. The resulting estimator would no longer be a simulated MLE but rather a simulated QMLE. In this setting, the asymptotic distribution should be adjusted to accommodate the fact that we are not using the true likelihood to estimate the parameters. This obviously extends to the case of misspeciÖed models as in White (1984). The above procedure is one example of how our simulation method can be applied to non- and semiparametric estimation problems where an inÖnite-dimensional component of the model is unknown. Another example is the situation where data has been generated by the model (1) with known distribution F", but now = ( ; ) where and are Önite- and inÖnite-dimensional parameters respectively. An application of our method in this setting can be found in Kristensen (2010) where is a density. Again, our asymptotic results have to be adjusted to allow for to contain inÖnite-dimensional parameters.
3 Asymptotic Properties of NPSMLE
Given the convergence of the simulated conditional density towards the true one, we expect that the NPSMLE ^ based on the simulated kernel density estimator will have the same asymptotic properties as the infeasible MLE ~ for a suitably chosen sequence N = N (T ) and h = h(N ). We give two sets of results: The Örst establishes that ^ is Örst-order asymptotic equivalent to ~ under general conditions, allowing for nonstationarity. Under additional assumptions, including stationarity, we derive expressions of the leading bias and variance components of ^ relative to the actual MLE due to simulations and kernel smoothing, and give results for the higher-order asymptotic properties of ^. We allow for a mixed discrete and continuous distribution of the response variable, and
write yt = (y 1 t; y 2 t) 2 Y 1 Y 2 , where Y 1 Rk^ and Y 2 = fy 2 ; 1 ; y 2 ; 2 ; :::g Rl. Here, y 1 t has a continuous distribution, while y 2 t is discrete. The joint distribution can be written as pt(y 1 ; y 2 jx; ) = pt(y 2 jy 1 ; x; )pt(y 1 jx; ) where pt(y 2 ;ijy 1 ; x; ) are conditional probabilities satisfy- ing Pl i=1 pt(y^2 ;ijy^1 ; x;^ ) = 1, while^ pt(y^1 jx;^ )^ is a conditional density w.r.t. the Lebesgue measure. Also, let pt(y 2 ;ijx; ) denote the conditional probabilities of y 2 tjxt = x. The asymptotics are derived for the kernel estimator given in equation (7) where
Y 1 t;i := g 1 ;t (xt; "i; ) ; (11)
Y 2 t;i := g 2 ;t (y 2 t; xt; "i; ) ; (12)
for i = 1; ; N and t = 1; ; T , where f"igNi=1 are i.i.d. draws from F", such that equation (6) holds. Recall that Y 2 t;i denotes a simulated value of the associated density, and not the outcome of the dependent variable. The condition in equation (6) is met when Y 2 t;i = K(1)( Z~t;i; y 2 tjxt) with K(1)^ given in equation (8), while it only holds approximately for K(2)^ and K^(2)^ deÖned in equations (9) and (10) due to biases induced by the use of kernel smoothing. We handle these two cases in Theorem 3.4 where results for approximate simulations are given. Note that we here use the same errors to generate the simulations over time. An alternative sim- ulation scheme would be to draw a new batch of errors for each observation xt, Y (^) t;i = gt (xt; "t;i; ),
i = 1; :::; N~ , such that the total number of simulations would be N~ T , f"i;tg Ni~=1, t = 1; :::; T. Under regularity conditions, the NPSMLE based on this simulation scheme would have similar asymptotic properties as the one based on the simulations in equations (11) and (12). However, as demonstrated in Lee (1992), choosing N = N T~ , the variance of the NPSMLE based on equations (11) and (12) will be smaller.^6 In order for ^ to be asymptotically equivalent to ~, we need p^ !P p su¢ ciently fast in some suitable function norm. To establish this, we verify the general conditions for uniform rates of kernel estimators found in Kristensen (2009). These general conditions are satisÖed under the following set of regularity conditions regarding the model and its associated conditional density:
A.1 The functions (x; t; ) 7! g 1 ;t (x; "; ) and (x; t; ) 7! g 2 ;t (y 2 ; x; "; ) are continuously di§eren- tiable for all y 2 and " such that for some function () and constants i;j 0 , i; j = 1; 2 ,
kg 1 ;t (x; "; )k (")
h 1 + kxk^1 ;^1 + t^1 ;^2
i ; kg 2 ;t (y 2 ; x; "; )k (")
h 1 + kxk^2 ;^1 + t^2 ;^2
i ;
and E [ (")s] < 1 for some s > 2. The derivatives of g 1 and g 2 w.r.t. (x; t; ) satisfy the same bounds.
A.2 The conditional density pt(y 1 ; y 2 jx; ) is continuous w.r.t. 2 , and r 2 times continuously di§erentiable w.r.t. y 1 with the r-th derivative being uniformly continuous. There exists (^6) The results of Lee (1992) are for discrete choice models, but we conjecture that his results can be extended to general simulated MLE.
Rk^ K^ (u)^ du^ = 1^ and for some^ r^ ^1 :^
R^ Rk^ K^ (u)^ u^ du^ = 0,^1 j^ j ^ r^ ^1 , and Rk^ K^ (u)^ kuk r (^) du < 1.
K.2 The Örst and second derivatives of K also satisfy K.1.1.
This is a broad class of kernels allowing for unbounded support. For example, the Gaussian kernel satisÖes K.1 with r = 2. When r > 2 , K is a so-called higher-order kernel that reduces the bias of p^ and its derivatives, and thereby obtains a faster rate of convergence. The smoothness of p as measured by its number of derivatives, r, determines the degree of bias reduction. The additional assumption K.2 is used in conjunction with Assumptions A.3 and A.4 to show that the Örst and the second derivatives of p^ w.r.t. also converge uniformly. Next, we impose regularity conditions on the model to ensure that the actual MLE is asymp- totically well-behaved. We Örst introduce the relevant terms driving the asymptotics of the MLE. We Örst normalize the log-likelihood by some factor T! 1:
t=
log pt(ytjxt; ):
This normalizing factor T is introduced to ensure that LT () is well-behaved asymptotically and that certain functions of data are suitably bounded, c.f. C.1ñC.4 below. It is only important for the theoretical derivations, and not relevant for the actual implementation of our estimator since T does not depend on . The choice of T depends on the dynamics of the model. The standard choice is T = T as is, for example, the case when the model is stationary. In order to allow for non-standard behavior of the likelihood due to, for example, stochastic and deterministic trends, we do not impose this restriction though. We also redeÖne the simulated version of the likelihood: In order to obtain uniform convergence of log ^pt (yjx; ), we need to introduce trimming of the approximate log-likelihood as is standard in the literature on semiparametric estimators. The trimmed and normalized version of the simulated log-likelihood is given as
t=
(^) a(^pt(ytjxt; )) log ^pt(ytjxt; );
where (^) a() is continuously di§erentiable trimming function satisfying (^) a(z) = 1 if jzj > a, and 0 if jzj < a= 2 , with a trimming sequence a = a(N )! 0. One could here simply use the indicator function for the trimming, but then L^T () would no longer be di§erentiable, and di§erentiability is useful when using numerical optimization algorithms to solve for ^. Assuming that LT () is three times di§erentiable, c.f. Assumption C.3 below, we can deÖne:
t=
@ log pt(ytjxt; ) @ 2 R
d;
t=
@^2 log pt(ytjxt; ) @@^0 2 Rdd;
GT;i() =
@@^0 @i^ =^
t=
@^3 log pt(ytjxt; ) @@^0 @i^2 R
dd:
The information is then deÖned as:
iT () = 1 T
t=
@ log pt(ytjxt; ) @
@ log pt(ytjxt; ) @^0
= E [HT ()] 2 Rdd:
We also deÖne the diagonal matrix IT () = diag fiT ()g 2 Rdd, where diag fiT ()g denotes the diagonal elements of the matrix iT (), and
UT () = I ^
(^12) T ()ST^ (); VT^ () =^ I