Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistical Modeling: Parametric Model with Unknown Parameters and i.i.d. Sequence, Study notes of Economics

A parametric statistical model with an unknown parameter vector and an i.i.d. sequence. The model assumes that the data is generated by a fully parametric model, and it provides conditions for the consistency and asymptotic distribution of the maximum likelihood estimator (MLE). The document also covers discretely-observed jump-diffusions, specifically the Cox-Ingersoll-Ross model, and derives the likelihood function and its properties.

Typology: Study notes

2021/2022

Uploaded on 09/12/2022

millionyoung
millionyoung 🇬🇧

4.5

(25)

242 documents

1 / 46

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Estimation of Dynamic Models with Nonparametric Simulated
Maximum Likelihood
Dennis KristensenyYongseok Shinz
June 2011
Abstract
We propose an easy-to-implement simulated maximum likelihood estimator for dynamic
models where no closed-form representation of the likelihood function is available. Our method
can handle any simulable model without latent dynamics. Using simulated observations, we non-
parametrically estimate the unknown density by kernel methods, and then construct a likelihood
function that can be maximized. We prove that this nonparametric simulated maximum likeli-
hood (NPSML) estimator is consistent and asymptotically cient. The higher-order impact of
simulations and kernel smoothing on the resulting estimator is also analyzed; in particular, it is
shown that the NPSML does not su¤er from the usual curse of dimensionality associated with
kernel estimators. A simulation study shows good performance of the method when employed
in the estimation of jump-di¤usion models.
We thank the seminar participants at Berkeley, BU, Brown, Columbia, LSE , NYU, Rice, and Stanford for many
useful comm ents. We also thank the referees who o¤ered exceptionally thorough and helpful comments. Kyu-Chul
Jung provided excellent research assistance. Kristensen gratefully acknowledges the …nancial support of the National
Science Foundation (SES-0961596) and of the Danish Research Foundation (through a grant to CREATES).
yDepartment of Economics, Columbia University and CREATES, Aarhus University (e-mail:
dk2313@columbia.edu).
zDepartment of Economics, Washington University in St. Louis and Federal Reserve Bank of St. Lou is (e-mail:
yshin@wustl.edu).
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e

Partial preview of the text

Download Statistical Modeling: Parametric Model with Unknown Parameters and i.i.d. Sequence and more Study notes Economics in PDF only on Docsity!

Estimation of Dynamic Models with Nonparametric Simulated

Maximum Likelihood

Dennis Kristenseny^ Yongseok Shinz

June 2011

Abstract We propose an easy-to-implement simulated maximum likelihood estimator for dynamic models where no closed-form representation of the likelihood function is available. Our method can handle any simulable model without latent dynamics. Using simulated observations, we non- parametrically estimate the unknown density by kernel methods, and then construct a likelihood function that can be maximized. We prove that this nonparametric simulated maximum likeli- hood (NPSML) estimator is consistent and asymptotically e¢ cient. The higher-order impact of simulations and kernel smoothing on the resulting estimator is also analyzed; in particular, it is shown that the NPSML does not su§er from the usual curse of dimensionality associated with kernel estimators. A simulation study shows good performance of the method when employed in the estimation of jump-di§usion models.

We thank the seminar participants at Berkeley, BU, Brown, Columbia, LSE, NYU, Rice, and Stanford for many useful comments. We also thank the referees who o§ered exceptionally thorough and helpful comments. Kyu-Chul Jung provided excellent research assistance. Kristensen gratefully acknowledges the Önancial support of the National Science Foundation (SES-0961596) and of the Danish Research Foundation (through a grant to CREATES). y Department of Economics, Columbia University and CREATES, Aarhus University (e-mail: dk2313@columbia.edu z ). Department of Economics, Washington University in St. Louis and Federal Reserve Bank of St. Louis (e-mail: yshin@wustl.edu).

1 Introduction

We propose a simulated maximum likelihood estimator for dynamic models based on nonparamet- ric kernel methods. Our method is designed for models where no closed-form representation of the likelihood function is available. Our method can handle any simulable model without latent dy- namics. For any given parameter value, conditioning on available past information, we draw N i.i.d. simulated observations from the model. We then use these simulated observations to nonparamet- rically estimate the conditional densityó unknown in closed formó by kernel methods. The kernel estimate converges to the true conditional density as N goes to inÖnity, enabling us to approxi- mate the true density arbitrarily well with a su¢ ciently large N. We then construct the likelihood and search over the parameter space to obtain a maximum likelihood estimatoró nonparametric simulated maximum likelihood estimator (NPSMLE). NPSML was introduced by Fermanian and SalaniÈ (2004), who obtained theoretical results only for static models. In this paper, we generalize their method to dynamic models, including nonstationary and time-inhomogeneous processes. We give general conditions for the NPSMLE to be consistent and have the same asymptotic distribution as the infeasible maximum likelihood estimator (MLE). For the stationary case, we also analyze the impact of simulations on the bias and variance of the NPSMLE. In particular, we show that the estimator does not su§er from the curse of dimensionality despite the use of kernel smoothers. Finally, we show that the theoretical results remain valid even if only simulations from an approximate model are available. NPSML can be used for estimating general classes of models, such as structural Markov decision processes and discretely-sampled di§usions. As for Markov decision processes, the transition density of endogenous state variables embodies an optimal policy function of a dynamic programming problem, and hence does not typically have a closed-form representation (Rust, 1994; Doraszelski and Pakes, 2007). However, we can closely approximate the optimal policy function numerically, and simulate observations from the model for NPSML. Similarly, as for the estimation of continuous-time stochastic models with discretely-sampled data, the transition densities are well-deÖned, but only in few special cases can we derive closed-form expressions for them. Again, a large class of continuous- time processes, including jump-di§usions, can be approximated with various discretization schemes to a given level of precision, and we can simulate observations from the model which are then used for NPSML. Indeed, we investigate the performance of NPSML when applied to jump-di§usion models with particular attention to the impact of number of simulations and bandwidth. We Önd that NPSML performs well even for a moderate number of simulations and that it is quite robust to the choice of bandwidth. For the classes of models that NPSML addresses, there are two categories of existing approaches. The Örst is based on moment matching, and includes simulated methods of moments (Lee and In- gram, 1991; Du¢ e and Singleton, 1993; Creel and Kristensen, 2009), indirect inference (GouriÈroux et al., 1993; Smith, 1993; Creel and Kristensen, 2011), and e¢ cient methods of moments (Gal- lant and Tauchen, 1996). These are all general-purpose methods, but cannot attain asymptotic

2 Nonparametric Simulated Maximum Likelihood

2.1 Construction of NPSMLE

Suppose that we have T observations, f(yt; xt)gTt=1, yt 2 Rk^ and xt 2 Xt. The space Xt can be time-varying. We assume that the data is generated by a fully parametric model:

yt = gt (xt; "t; ) ; t = 1;    ; T; (1)

where  2   Rd^ is an unknown parameter vector, and "t is and i.i.d. sequence with known distribution F" and independent of xt and "t. Without loss of generality, assume that F" is known and does not depend on t and . Our setting accommodates Markov models where xt  yt 1 , such that fytg is a (possibly time-inhomogeneous) Markov process. In this case (1) is a fully-speciÖed model. However, we allow xt to contain other (exogenous) variables than lagged yt, in which case (1) is only a partially-speciÖed model. Also, we allow the processes (yt; xt) to be nonstationary, for example due to unit-root-type behavior or deterministic time trends. The model is assumed to have an associated conditional density pt(yjx; ). That is,

P (yt 2 Ajxt = x) =

Z

A

pt(yjx; )dy; t = 1;    ; T;

for any Borel set A  Rk. A natural estimator of  is then the maximizer of the conditional log-likelihood:

~ = arg max  2  LT^ ();^ LT^ () =

X^ T

t=

log pt(ytjxt; ):

If the model (1) is fully speciÖed, i.e. xt only contains lagged yt, then this is the full likelihood of the model conditional on the starting value. If on the other hand, xt contains other variables than lagged yt, LT () is a partial likelihood. Suppose now that pt(yjx; ) does not have a closed-form representation, and thus the maximum likelihood estimation of  is not feasible. In terms of the model (1), this occurs when either the inverse of gt(xt; "t; ) w.r.t. "t does not exist, or when the inverse does not have a closed-form expression.^1 Such a situation may arise, for example, when the function g involves a solution to a dynamic programming problem, or when we are dealing with discretely-sampled jump-di§usions. In such cases, although pt(yjx; ) is not available in closed form, we are still able to generate simulated observations from the model: A solution to a dynamic programming problem can be represented numerically, and a jump-di§usion can be approximated by various discretization schemes up to a given level of precision. We here propose a general method to obtain a simulated conditional density, which in turn will be used to obtain a simulated version of the MLE. For any given 1  t  T , yt 2 Rk, xt 2 Xt, and

(^1) If the inverse has a closed-form expression, we have pt(yjx; ) = p"^ g t 1 (y; x; )^ @g t (^1) @y(y;x ;), and the likelihood is easily evaluated.

 2 , we wish to compute a simulated version of pt(ytjxt; ). To this end, we Örst generate N i.i.d. draws from F", f"igNi=1, through a random number generator, and use these to compute

Y (^) t;i = gt (xt; "i; ) ; i = 1;    ; N:

By construction, the N simulated i.i.d. random variables, fY (^) t;igNi=1, follow the target distribution: Y (^) t;i  pt(jxt; ), i = 1;    ; N. They can therefore be used to estimate pt(yjx; ) with kernel methods. DeÖne:

p ^t(ytjxt; ) =^1 N

X^ N

i=

Kh(Y (^) t;i yt); (2)

where Kh() = K(=h)=hk, K : Rk^7! R is a kernel, and h > 0 a bandwidth.^2 Under regularity conditions on pt and K, we obtain:

p ^t(ytjxt; ) = pt(ytjxt; ) + OP (1=

p N hk) + OP (h^2 ); N! 1;

where the remainder terms are oP (1) if h! 0 and N hk^! 1. Once (2) has been used to obtain the simulated conditional density, we can now construct the following simulated MLE of  0 :

^ = arg max  2 

L^ ^T (); L^T () =

X^ T

t=

log ^pt(ytjxt; ):

When searching for ^ through numerical optimization, we use the same draws for all values of . We may also use the same batch of draws from F"(), f"igNi=1, across di§erent values of t and x. Numerical optimization is facilitated if L^T () is continuous and di§erentiable in . With (2), if K and  7! gt (x; "; ) are r  0 times continuously di§erentiable, then L^T () has the same property. This follows from the chain rule and the fact that we use the same random draws f"igNi=1 for all values of . Since p^t(ytjxt; ) !P pt(ytjxt; ), L^T () !P LT () as N! 1 for a given T  1 under regularity conditions. The main theoretical results of this paper demonstrate that ^ inherits the properties of the infeasible MLE, ~, as T; N! 1, under suitable conditions. The precision of ^ relative to ~ clearly depends on the quality of the approximation of pt(yjx; ) by p^t(yjx; ). Let us note the following important points concerning the impact of the simulated density. Firstly, because we use i.i.d. draws, the density estimator is not a§ected by the dependence structure in the observed data. In particular, our estimator works whether the observed data are i.i.d. or nonstationary. Secondly, the simulated density, p^t(yjx; ), su§ers from the usual curse of dimensionality for kernel density estimators with its variance being of order 1 =(N hk). The curse of dimensionality only depends on k  dim(yt) here since we do not smooth over xt, and so the dimension of xt is irrelevant in itself. Still one could be concerned that for high-dimensional (^2) Here and in the following, we will use K to denote a generic kernel.

the wavelet estimator of Donoho et al. (1996).

Example: Discretely-Observed Jump Di§usion Consider an Rk-dimensional continuous- time stochastic process fyt : t  0 g that solves the following stochastic di§erential equation:

dyt =  (t; yt; ) dt +  (t; yt; ) dWt + JtdQt: (3)

The model contains both continuous and jump components. Wt 2 Rl^ is a standard Brownian motion, while Qt is an independent pure jump process with stochastic intensity (t; yt; ) and jump size 1. The functions  : [0; 1 )  Rk^   7! Rk^ and  : [0; 1 )  Rk^   7! Rkl^ is the drift and the di§usion term respectively, while Jt measures the jump sizes and has density v (t; yt; ). Such jump di§usions are widely used in Önance to model the dynamics of stock prices, interest rates, exchange rates and so on (Sundaresan, 2000). Suppose we have a sample y 1 ; :::; yT ó without loss of generality, we normalize the time interval between observations to 1ó and wish to estimate  by maximum likelihood. Although under regularity conditions (Lo, 1988) the transition density pt(yjx; ) satisfying P (yt+1 2 Ajyt = x) =

R

A pt(yjx;^ )dy^ is well-deÖned, it cannot in general be written in closed form which in turn complicates estimation.^5 However, discretization schemes (Kloeden and Platen, 1992; Bruti-Liberati and Platen, 2007) can be used to simulate observations from the model for any given level of accuracy, enabling NPSML. We re-visit this example in Section 4 where we provide a detailed description of implementing NPSML in practice.

2.2 Extensions and Alternative Schemes

Latent Dynamics Our method can be modiÖed to handle dynamic latent variables: Suppose yt is generated from

[yt; wt] = g (yt 1 ; wt 1 ; "t; ) ;

where wt is unobserved/latent and "t i.i.d. F". The full likelihood function will require com- putation of conditional densities on the form p (ytjyt 1 ; yt 2 ; :::; y 0 ; ) which in general is compli- cated due to the expanding information set; see e.g. Brownlees et al. (2011). We can however construct a simulated version of the following "limited information" likelihood (LIL) given by LT () = Pt t=1 log^ p^ (ytjxt;^ )^ where^ xt^ is a set of conditioning variables chosen by the econometri- cians, say, xt = (yt 1 ; :::; ytm) for some m  1. There will be an e¢ ciency loss from estimating  using this LIL relative to the full likelihood, but the LIL is a lot easier to implement: First simulate a (long) trajectory fY (^) t g Nt~=1 by h Y (^) t ; W (^) t

i = g

Y (^) t 1 ; W (^) t 1 ; "t; 

; t = 1; ::::; N~ , (^5) Schaumburg (2001) and Yu (2007), building on the approach of AÔt-Sahalia (2002), use analytic expansions to approximate the transition density for univariate and multivariate jump di§usions, respectively. Their asymptotic result requires that the sampling interval shrink to zero. The simulated MLE of Pedersen (1995a,b) or Brandt and Santa-Clara (2002) need to be substantially modiÖed before they can be applied to LÈvy processes.

where f"tg Nt~=1 are i.i.d. draws from F". We can then use these simulations to construct a simulated version of p (ytjxt; ) by the following kernel estimator of the conditional density,

p (yjx; ) =

P N~

t=1 Kh(Y^ t ^ y)Kh(Xt ^ x) P (^) N~ t=1 Kh(Xt ^ x)

where Xt =

Y (^) t 1 ; :::; Y (^) tm

. Similar ideas were utilized in Altissimo and Mele (2009) and Creel and Kristensen (2009). A disadvantage of the above method is that the convergence of p relative to p^ will be slower due to (i) the dimension of (Y (^) t ; Xt ) can potentially be quite large and (ii) the simulated variables are now dependent. So one will have to choose a larger N~ for the simulated conditional density in (4) relative to the one in (2). To handle (ii), one will typically have to assume a stationary solution to the dynamic system under consideration, and either have to start the simulation from the stationary distribution, or assume that the simulated process converges towards the stationary distribution at a suitable rate. For the latter to hold, one will need to impose some form of mixing condition on the process, as in Altissimo and Mele (2009) and Creel and Kristensen (2009). Then a large value of N~ is needed to ensure that the simulated process is su¢ ciently close to its stationary distributionó that is, one has to allow for a burn-in. The estimator in (4) may work under nonstationarity as well. Recently, a number of papers have considered kernel estimation of nonstationary Markov processes. The kernel estimator proves to be consistent and asymptotically mixed-normally distributed when the Markov process is recurrent (Karlsen and Tj¯stheim, 2001; Bandi and Phillips, 2003). However, the convergence rate will be path-dependent and relatively slow. In the remainder of this paper we focus on (2). The properties of (4) can be obtained by following the same strategy of proof as the one we employ for (2). The only di§erence is that, to obtain p !P p in the sup-norm one has to take into account the dependence of the simulated values. This can be done along the lines of Creel and Kristensen (2009) where kernel regressions and simulations are combined to compute GMM estimators for dynamic latent variable models.

Discrete Random Variables Discrete random variables can be accommodated within our framework. Suppose yt contains both continuous and discrete random variables. For example, yt = (y 1 t; y 2 t) 2 Rk+l^ where y 1 t 2 Rk^ is a continuous random variable while y 2 t 2 Y 2  Rl^ is a ran- dom variable with (potentially inÖnite number of) discrete outcomes, Y 2 = fy 2 ; 1 ; y 2 ; 2 ; :::g. We could then use a mixed kernel to estimate pt (yjx). For given simulated observations Y (^) t;i =

Y 1 t;i; Y 2 t;i

i = 1; :::; N :

p ^t(y 1 ; y 2 jx; ) =

N

X^ N

i=

Kh(Y 1 t;i y 1 )IfY 2 t;i = y 2 tg; (y 1 t; y 2 t) 2 Rk+l; (5)

where Ifg is the indicator function and K : Rk^7! R is the kernel from before. However, the resulting simulated log-likelihood will be discontinuous and optimization may be di¢ cult. One

In all three cases, we can write the resulting simulated joint density in equation (7) by choosing Y 2 t;i = K(1)( Z~t;i; y 2 tjxt), Y 2 t;i = K b(2)

Zt;i; y 2 t

and Y 2 t;i = K^(2) b

Zt;i; y 2 t

, respectively. Here,  7! Y 2 t;i is smooth with a bias that disappears as b! 0 and variance that is bounded in b. Thus, the order of the variance of L^T () is not a§ected by any added discrete variables, and the curse of dimensionality remains of order k = dim (y 1 t).

Quasi Maximum Likelihood Estimation The use of our approximation method is not limited to actual MLEs. In many situations, one can deÖne quasi- or pseudo-likelihood which, even though it is not the true likelihood, identiÖes the parameters of the true model. One obvious example of this is the standard regression model, where the MLE based on Gaussian errors (i.e. the least- squares estimator) proves to be robust to deviations from the normality assumption. Another example is estimation of (G)ARCH models using quasi-maximum likelihoodó e.g. Lee and Hansen (1994). These are cases where the quasi-likelihood can be written explicitly. If one cannot Önd explicit expressions of the quasi-likelihood, one can instead employ our estimator, simulating from the quasi-model: Suppose for example that data has been generated by the model (1), but the data- generating distribution of the errors is unknown. We could then choose a suitable distribution F", draw f"igNi=1 from F" and then proceed as in Section 2.1. The resulting estimator would no longer be a simulated MLE but rather a simulated QMLE. In this setting, the asymptotic distribution should be adjusted to accommodate the fact that we are not using the true likelihood to estimate the parameters. This obviously extends to the case of misspeciÖed models as in White (1984). The above procedure is one example of how our simulation method can be applied to non- and semiparametric estimation problems where an inÖnite-dimensional component of the model is unknown. Another example is the situation where data has been generated by the model (1) with known distribution F", but now  = ( ; ) where and are Önite- and inÖnite-dimensional parameters respectively. An application of our method in this setting can be found in Kristensen (2010) where is a density. Again, our asymptotic results have to be adjusted to allow for  to contain inÖnite-dimensional parameters.

3 Asymptotic Properties of NPSMLE

Given the convergence of the simulated conditional density towards the true one, we expect that the NPSMLE ^ based on the simulated kernel density estimator will have the same asymptotic properties as the infeasible MLE ~ for a suitably chosen sequence N = N (T ) and h = h(N ). We give two sets of results: The Örst establishes that ^ is Örst-order asymptotic equivalent to ~ under general conditions, allowing for nonstationarity. Under additional assumptions, including stationarity, we derive expressions of the leading bias and variance components of ^ relative to the actual MLE due to simulations and kernel smoothing, and give results for the higher-order asymptotic properties of ^. We allow for a mixed discrete and continuous distribution of the response variable, and

write yt = (y 1 t; y 2 t) 2 Y 1  Y 2 , where Y 1  Rk^ and Y 2 = fy 2 ; 1 ; y 2 ; 2 ; :::g  Rl. Here, y 1 t has a continuous distribution, while y 2 t is discrete. The joint distribution can be written as pt(y 1 ; y 2 jx; ) = pt(y 2 jy 1 ; x; )pt(y 1 jx; ) where pt(y 2 ;ijy 1 ; x; ) are conditional probabilities satisfy- ing Pl i=1 pt(y^2 ;ijy^1 ; x;^ ) = 1, while^ pt(y^1 jx;^ )^ is a conditional density w.r.t. the Lebesgue measure. Also, let pt(y 2 ;ijx; ) denote the conditional probabilities of y 2 tjxt = x. The asymptotics are derived for the kernel estimator given in equation (7) where

Y 1 t;i := g 1 ;t (xt; "i; ) ; (11)

Y 2 t;i := g 2 ;t (y 2 t; xt; "i; ) ; (12)

for i = 1;    ; N and t = 1;    ; T , where f"igNi=1 are i.i.d. draws from F", such that equation (6) holds. Recall that Y 2 t;i denotes a simulated value of the associated density, and not the outcome of the dependent variable. The condition in equation (6) is met when Y 2 t;i = K(1)( Z~t;i; y 2 tjxt) with K(1)^ given in equation (8), while it only holds approximately for K(2)^ and K^(2)^ deÖned in equations (9) and (10) due to biases induced by the use of kernel smoothing. We handle these two cases in Theorem 3.4 where results for approximate simulations are given. Note that we here use the same errors to generate the simulations over time. An alternative sim- ulation scheme would be to draw a new batch of errors for each observation xt, Y (^) t;i = gt (xt; "t;i; ),

i = 1; :::; N~ , such that the total number of simulations would be N~  T , f"i;tg Ni~=1, t = 1; :::; T. Under regularity conditions, the NPSMLE based on this simulation scheme would have similar asymptotic properties as the one based on the simulations in equations (11) and (12). However, as demonstrated in Lee (1992), choosing N = N T~ , the variance of the NPSMLE based on equations (11) and (12) will be smaller.^6 In order for ^ to be asymptotically equivalent to ~, we need p^ !P p su¢ ciently fast in some suitable function norm. To establish this, we verify the general conditions for uniform rates of kernel estimators found in Kristensen (2009). These general conditions are satisÖed under the following set of regularity conditions regarding the model and its associated conditional density:

A.1 The functions (x; t; ) 7! g 1 ;t (x; "; ) and (x; t; ) 7! g 2 ;t (y 2 ; x; "; ) are continuously di§eren- tiable for all y 2 and " such that for some function () and constants i;j  0 , i; j = 1; 2 ,

kg 1 ;t (x; "; )k  (")

h 1 + kxk^1 ;^1 + t^1 ;^2

i ; kg 2 ;t (y 2 ; x; "; )k  (")

h 1 + kxk^2 ;^1 + t^2 ;^2

i ;

and E [ (")s] < 1 for some s > 2. The derivatives of g 1 and g 2 w.r.t. (x; t; ) satisfy the same bounds.

A.2 The conditional density pt(y 1 ; y 2 jx; ) is continuous w.r.t.  2 , and r  2 times continuously di§erentiable w.r.t. y 1 with the r-th derivative being uniformly continuous. There exists (^6) The results of Lee (1992) are for discrete choice models, but we conjecture that his results can be extended to general simulated MLE.

R

Rk^ K^ (u)^ du^ = 1^ and for some^ r^ ^1 :^

R

R^ Rk^ K^ (u)^ u^ du^ = 0,^1  j^ j ^ r^ ^1 , and Rk^ K^ (u)^ kuk r (^) du < 1.

K.2 The Örst and second derivatives of K also satisfy K.1.1.

This is a broad class of kernels allowing for unbounded support. For example, the Gaussian kernel satisÖes K.1 with r = 2. When r > 2 , K is a so-called higher-order kernel that reduces the bias of p^ and its derivatives, and thereby obtains a faster rate of convergence. The smoothness of p as measured by its number of derivatives, r, determines the degree of bias reduction. The additional assumption K.2 is used in conjunction with Assumptions A.3 and A.4 to show that the Örst and the second derivatives of p^ w.r.t.  also converge uniformly. Next, we impose regularity conditions on the model to ensure that the actual MLE is asymp- totically well-behaved. We Örst introduce the relevant terms driving the asymptotics of the MLE. We Örst normalize the log-likelihood by some factor T! 1:

LT () = 1

T

XT

t=

log pt(ytjxt; ):

This normalizing factor T is introduced to ensure that LT () is well-behaved asymptotically and that certain functions of data are suitably bounded, c.f. C.1ñC.4 below. It is only important for the theoretical derivations, and not relevant for the actual implementation of our estimator since T does not depend on . The choice of T depends on the dynamics of the model. The standard choice is T = T as is, for example, the case when the model is stationary. In order to allow for non-standard behavior of the likelihood due to, for example, stochastic and deterministic trends, we do not impose this restriction though. We also redeÖne the simulated version of the likelihood: In order to obtain uniform convergence of log ^pt (yjx; ), we need to introduce trimming of the approximate log-likelihood as is standard in the literature on semiparametric estimators. The trimmed and normalized version of the simulated log-likelihood is given as

L^ ^T () = 1

T

XT

t=

 (^) a(^pt(ytjxt; )) log ^pt(ytjxt; );

where  (^) a() is continuously di§erentiable trimming function satisfying  (^) a(z) = 1 if jzj > a, and 0 if jzj < a= 2 , with a trimming sequence a = a(N )! 0. One could here simply use the indicator function for the trimming, but then L^T () would no longer be di§erentiable, and di§erentiability is useful when using numerical optimization algorithms to solve for ^. Assuming that LT () is three times di§erentiable, c.f. Assumption C.3 below, we can deÖne:

ST () =

@LT ()

@ =^

T

XT

t=

@ log pt(ytjxt; ) @ 2 R

d;

HT () = @

2 LT ()

@@^0

T

XT

t=

@^2 log pt(ytjxt; ) @@^0 2 Rdd;

GT;i() =

@^3 LT ()

@@^0 @i^ =^

T

XT

t=

@^3 log pt(ytjxt; ) @@^0 @i^2 R

dd:

The information is then deÖned as:

iT () = 1 T

XT

t=

E

@ log pt(ytjxt; ) @

@ log pt(ytjxt; ) @^0

= E [HT ()] 2 Rdd:

We also deÖne the diagonal matrix IT () = diag fiT ()g 2 Rdd, where diag fiT ()g denotes the diagonal elements of the matrix iT (), and

UT () = I^

(^12) T ()ST^ (); VT^ () =^ I

(^12) T ()HT^ ()I

(^12) T (); WT;i() =^ I

(^12) T ()GT;i()I

(^12) T ():^ (14)

With IT  IT ( 0 ), we then impose the following conditions on the actual log-likelihood function and the associated MLE which ensure consistency and a well-deÖned asymptotic distribution of the actual MLE, ~:

C.1 The parameter space is given by a sequence of local neighborhoods,

T =

n  : kI T^1 = 2 (  0 ) k  

o  Rd;

for some  > 0 with I T 1 = OP (1).

C.2 For any  > 0 , there exists a  > 0 such that

T^ lim !1 P

@ (^) sup kI T^1 = 2 ( 0 )k>

fLT ( 0 ) LT ()g  

A = 1:

C.3 LT () is three times continuously di§erentiable with its derivatives satisfying:

p T UT ( 0 ); VT ( 0 )  (^) d ! (S 1 ; H 1 ) with H 1 < 0 a.s.;

  1. maxj=1;:::;d sup 2 T kWj;T ()k = OP (1).

C.4 The following bounds hold for some ; q > 0 :

  1. sup 2 T Tq

PT

t=1 jlog^ pt(ytjxt;^ )j

1+ = OP (1);

  1. Tq

PT

t=1 kxtk 1+ (^) = OP (1) and q T

PT

t=1 ^2 ("t) =^ OP^ (1).

The above conditions C.1ñC.4 are generalized versions of the conditions normally required for consistency and asymptotic normality of MLEís in stationary and ergodic models. For general non-ergodic models, simple conditions for C.2ñC.4 are not available and they have to be veriÖed

will both be satisÖed. However, a large value of q implies that we have to use a larger number of simulations for the NPSMLE to be asymptotically equivalent to the MLE, c.f. B.1 and B.2 below. As an example of non-standard asymptotics of the MLE, consider a linear error-correction model, yt = 0 yt 1 + 1 =^2 "t, where "t  N (0; Ik). We can split the parameter vector into short- run,  1 = ( ; vech ( )), and long-run parameters,  2 =. The MLE ~ 1 converges with

p T -speed towards a normal distribution, while ~ 2 is superconsistent with T (~ 2  2 ) converging towards a Dickey-Fuller type distribution. In this case, we choose pT =

p T , and so iT ( 0 ) and therefore IT , is not asymptotically constant. As demonstrated in Saikkonen (1995), this model satisÖes C.2ñC.4. Furthermore, xt = yt 1 satisÖes T ^2

PT

t=1 kxtk 1+ (^) = OP (1) so we can choose q = 2. We also refer

to Park and Phillips (2001) and Kristensen and Rahbek (2010) where C.2ñC.4 are veriÖed for some non-linear, non-stationary models. We impose the following restrictions on how the bandwidth h and trimming sequence a can converge to zero in conjunction with N; T! 1:

B. With q;  > 0 given in Condition C.4, k =  0 ;k +  1 ;k +  2 ;k, k = 1; 2 , where i; 1 ; i; 2  0 , i = 0; 1 ; 2 , are given in Assumptions A.1 and A.2 and for some > 0 :

  1. j log ajq T 1 N ^ (1+)^! 0 ; j log(4a)jq T 1! 0 ; T  T 1 a^1

h N ^1 + T ^2

i log(N )=

p N hk^! 0 ; and T  T 1 a^1

N ^0 ;^1 + T ^0 ;^2

hr^! 0.

  1. j log ajqT N ^ (1+)^! 0 ; j log(4a)jqT! 0 ; T  T 1 =^2 a^1

h N ^1 + T ^2

i log(N )=

p N hk^! 0 ; and T  T 1 =^2 a^1

N ^0 ;^1 + T ^0 ;^2

hr^! 0.

Condition B.1 is imposed when showing consistency of the NPSMLE, while B.2 will imply that the NPSMLE has the same asymptotic distribution as the MLE. The parameter > 0 can be chosen freely. We observe that large values of q and/or  1 ;  2 implies that N has to diverge at a faster rate relative to T. In practice, this means that a larger number of simulations have to be used for a given T to obtain a precise estimate. The joint requirements imposed on a, h and N are fairly complex, and it is not obvious how to choose these nuisance parameters for a given sample size T. This is a problem shared by, for example, semiparametric estimators that rely on a preliminary kernel estimator. We refer to Ichimura and Todd (2007) for an in-depth discussion of these matters. Fortunately, our simulation results indicate that standard bandwidth selection rules together with a bit of undersmoothing in general deliver satisfactory results. Our strategy of proof is based on some apparently new results for approximate estimators, c.f. Appendix A. In particular, Theorems A.4 and A.5 establish that the NPSMLE and the MLE will be asymptotically Örst-order equivalent if L^T () converges uniformly towards LT () at a su¢ ciently fast rate. This makes our proofs considerably less burdensome than those found in other studies of simulation-based estimatorsó e.g. Fermanian and SalaniÈ (2004) and Altissimo and Mele (2009)ó since we do not need to analyze the simulated score and Hessian.

Theorem 3.1 Assume that Assumptions A.1, A.2, and K. 1 hold. Then the NPSMLE ^ based on (7) satisÖes:

(i) Under Conditions C.1, C.2, and C.4: I T^1 = 2 (^  0 ) = oP (1) for any sequences N! 1, and h; a! 0 satisfying B.1.

(ii) Under Conditions C.1, C.3, and C.4: pT I T^1 = 2 (^ 0 )! d H 1 ^1 S 1 for any sequences N! 1, and h; a! 0 satisfying B.2.

When the data generating process is stationary and ergodic, the following more primitive con- ditions can be shown to imply C.1ñC.4:

Corollary 3.2 Assume that (yt; xt) is stationary and ergodic, and that Assumptions A.1, A.2, K.1, and B.1 hold with q = 1, T = T ,  2 =  0 ; 2 = 0 and:

(i) E[ kxtk1+] < 1 , j log p(yjx; )j  b 1 (yjx), 8  2 , with E

b 1 (ytjxt)1+

< 1 and  compact;

(ii) E [log p(ytjxt; )] < E [log p(ytjxt;  0 )], 8  6 =  0.

Then ^ !P  0. If furthermore B.2 holds with q = 1, T = T and  2 =  0 ; 2 = 0 together with:

(iii) i( 0 ) = E

h (^) @ log p(y tjxt; 0 ) @

@ log p(ytjxt; 0 ) @^0

i exists and is nonsingular;

(iv) k @ (^2) log p(yjx;) @@^0 k ^ b^2 (yjx)^ uniformly in a neighborhood of^ ^0 with^ E^ [b^2 (ytjxt)]^ <^1 ;

then

p T (^  0 )! Nd (0; i( 0 )^1 ).

If for simplicity we set = 0 in B.2 and disregard the conditions on the trimming parameter a, then roughly speaking the NPSMLE will be Örst-order equivalent to the exact MLE in the stationary case if T h^2 r^! 0 and T =

N hk

! 0 reáecting the bias and variance due to kernel smoothing and simulations. The variance requirement seems to indicate that the usual curse of dimensionality inherent in kernel density estimation is present. However, this is caused by the initial error bounds used to establish Corollary 3.2 being overly conservative. In the following we will obtain more precise error bounds which show that the curse of dimensionality is signiÖcantly less severe. Moreover, these reÖned error bounds allow us to better gauge which additional biases and variances the NPSMLE su§ers from due to simulations and kernel smoothing. These can potentially be used to adjust conÖdence bands based on the NPSMLE to take into account the additional simulation errors. Since the higher-order analysis involves the Örst and second derivatives of L^T (), we have to invoke the additional smoothness conditions on g and p stated in Assumptions A.3 and A.4. Under

independent variables where Z 1  N (0; i( 0 )) is the variance component of the observed data, while Z 2  N (0; Var ( 2 ("i))) is the variance component of the simulations. The variance of Z 2 is given by

2 ("i) =^ E

" _

Y 2 t;i^0 p (y 2 tjxt) j"i

E

s(Y 1 t;i^0 ; y 2 tjxt)Y 2 t;i^0 p (y 2 tjxt) j"i

where s(y 1 ; y 2 jx) denotes the score at  =  0. The second order term also contains a bias component which all non-linear, simulation-based estimators su§er from,

p T r^2 ST;N [^p p; p^ p] '

p T N hk+^

 2 + OP

p T h^2 r

while the remainder term is of a lower order,

p T RT;N = OP

p T =

N hk+

+ OP

p T h^3 r

The two leading bias terms in the above expressions,  1 and  2 , are given by:

 1 =

X j j=r

Z Z  (^) @j j+1p (y tjxt;^ ) @@yt^ ^ s^ (ytjxt;^ )^

@j^ jp (ytjxt; ) @yt

 p (xt) dxtdyt 2 Rd; (22)

 2 = E

2 64 Y^ _^ 1 t;i^0 ^ Y^2 t;i^0 ^2 p

 Y 1 t;i^0 ; y 2 tjxt

 p (y 2 tjxt)

3 (^75) Z K (v) K(1)^ (v) dv 2 Rd; (23)

where p (xt) denotes the marginal density of xt. This shows that the overall bias of the estimator due to the use of simulations and kernel smoothing is i^1 ( 0 )

hr 1 + 1=

N hk+

 2 while an additional variance term relative to the exact MLE shows up and is given by T =N  i^1 ( 0 )Var

2 ("i)

i^1 ( 0 ). Thus, if

p p T hr^!^0 and T =

N hk+

! 0 , all bias terms vanish and

p T (^  0 ) follows a normal distribution centered around zero. If furthermore T =N! 0 , no additional variance will be present and the NPSMLE is Örst-order equivalent to the true MLE. On the other hand, if either

p T hr^ or

p T =

N hk+

does not vanish, a bias term will be present and the asymptotic distribution will not be centered around zero. Also, if T =N 9 0 there will be an increase in variance due to the presence of Z 2. One could potentially reduce (or even remove) some of these bias and variance components by employing the techniques of Kristensen and Salanie (2010) who develop higher-order improvements of simulation-based estimators. We collect the results in the following theorem:

Theorem 3.3 Assume that:

(i) f(yt; xt)g is stationary and -mixing with geometrically decreasing mixing coe¢ cients;

(ii) A. 1 ñA. 4 and K. 1 ñK. 2 hold;

(iii) (i)ñ(iv) of Corollary 3.2 hold;

(iv) xt is bounded and infy 1 ;y 2 ;x; p (y 1 ; y 2 jx; ) > 0.

Then, if

p T hr^! c 1  0 ,

p T =

N hk+

! c 2  0 and T =N! c 3  0 , p T (^  0 )! Nd

c; i^1 ( 0 ) [i( 0 ) + c 3 Var ( 2 ("i))] i^1 ( 0 )

where c = i^1 ( 0 ) fc 1  1 + c 2  2 g, with  1 and  2 as in equations (22) and (23).

The requirement (iv) is only imposed to simplify the proofs which otherwise would get overly long and complicated. We expect that the above result will still hold with the requirement (iv) replaced by additional restrictions on the trimming parameter a. For the case where an unbiased estimator of the density is available and a new batch of simu- lations is used for each observation, Lee (1999) derives results similar to Theorem 3.3.

Estimation of Asymptotic Distribution To do any Önite-sample inference, an estimator of the asymptotic distribution is needed. A general Monte Carlo method would be to simulate a large number of independent, long trajectories from the model and for each trajectory compute the corresponding score and hessian at  = ^. This would yield an approximation of the limiting distribution, H 1 ^1 S 1. The computation of the score and Hessian can be done in several ways. If the model satisÖes Assumption A.3, the estimators of the score and Hessian given in equation (17) and the proof of Theorem 3.3 are available. In the general case, a simple approach is to use numerical derivatives. DeÖne:

@ p^t(yjx; ) @k = p^t(yjx;^ ^ +^ ek)^ ^ ^pt(yjx;^ ^ ^ ek) 2 

where ek is the kth column of the identity matrix. We have:

@ p^t(yjx; ) @k^ ^

@pt(yjx; ) @k = p^t(yjx;^ ^ +^ ek)^ ^ pt(yjx;^ ^ +^ ek) 2  p^t(yjx;^ ^ ^ ek)^ ^ pt(yjx;^ ^ ^ ek) 2 

pt(yjx;  + ek) pt(yjx;  ek) 2  ^

@pt(yjx; ) @k

Now letting  = (N )! 0 as N! 1 at a suitable rate, all three terms are oP (1). A consistent estimator of the second derivative can be obtained in a similar fashion. These can in turn be used to construct estimators of the information and score.

Approximate Simulations In many cases, the model in (1) is itself intractable, such that one cannot directly simulate from the exact model. Suppose that one on other hand has an approxi- mation of the model at oneís disposal. For example, solutions to dynamic programming problems