Idea Transcript
AI4c 9o707 BAYESIAN MODEL CHGICE: ASYMPTOTICS AND EXACT CALCULATIONS Accesion For
by Alan E. Gelfand
NTIS
CRA&I
DlIC
TAB
UIannotuiced
D.K. Dey
Jwstilcx+tion By
TECHNICAL REPORT No. 470 JUNE 15, 1993
..................
D.t ibtlon/ Availability Codes SAvail oii~d/or
Dist Prepared Under
N0001492J1264
Contract
(NR042267)
FOR THE OFFICE OF NAVAL RESEARCH Professor Herbert Solomon, Project Director Reproduction in whole or in part is permitted for any purpose of the United States Government.
Approved for public release; distribution unlimited
DEPARTMENT OF STATISTICS STANFORD
UNIVERSITY
STANFORD, CALIFORNIA
943054065
Special A
Bayesian Model Choice: Asymptotics and Exact Calculations A.E. Gelfand and D.K. Dey*
SUMMARY Model determination is a fundamental data analytic task. Here we consider the problem of choosing amongst a finite (with loss of generality we assume two) set of models. After briefly reviewing classical and Bayesian model choice strategies we present a general predictive density which includes all proposed Bayesian approaches we are aware of. Using Laplace approximations we can conveniently assess and compare asymptotic behavior of these approaches. Concern regarding the accuracy of these approximation for small to moderate sample sizes encourages the use of Monte Carlo techniques to carry out exact calculations. A data set fit with nested non linear models enables comparison between proposals and between exact and asymptotic values.
Key words: Bayes factor, Laplace approximations, Likelihood ratio statistics, Monte Carlo methods.
2
1. INTRODUCTION Model determination is a fundamental data analytic task. It is also a complex matter influenced by intended use for the model as well as perspective regarding the class
of models being entertained. Thus, it is not surprising that no widely accepted strategy for building models has emerged. Here we consider a much less ambitious problem, that of choosing, within a Bayesian modeling framework, amongst a finite specified set of models. Such selection is based upon predictive distributions and we provide a general predictive formulation which includes all proposed solutions we are aware of. By employing Laplace approximations (see, e.g., Lindley 1980, Tierney & Kadane 1986) we can carry out asymptotic calculations to conveniently assess and compare asymptotic behavior of these proposals and develop tieins with classical likelihood ratio based strategies. We shall then describe how simulation based techniques can be employed to perform exact calculations. In this regard, we wish to use the output of recently discussed Markov Chain Monte Carlo methods for Bayesian computation which supply samples essentially from the posterior distribution. We shall also present some comparison between proposals, and between exact and asymptotic values through the fitting of nested non linear models to a data set consisting of 57 points given in Bates and Watts (1988). The literature on Bayesian model choice is considerable by now. It begins with the formal Bayes approach which, in the case of two models, results in the Bayes factor. Subsequent work has proposed modified Bayes factors. A current, reasonably thorough review appears in Gelfand, Dey and Chang (1992) and its attendant discnssion. Our hope in this paper is to achieve some unification of this substantial literature. We note that all of this work presumes that choice is made by reducing each model to a single summary number and then comparing these numbers. Such severe data reduction only permits model comparison in aggregate; observation or caselevel diagnostics enable a clearer comparison of model performance. For work of this sort in the Bayesian context see Geisser 1988, Pettit and Young 1990, Gelfand, Dey & Chang (1992). The outline of the paper is thus the following. In Section 2 we review the behavior of the likelihood ratio statistic and well known adjustments to it. In Section 3 we summarize the Bayesian view of model choice through predictive distributions and the Bayes factor. This motivates a general definition of predictive densities in Section 4 which includes known cases in the literature and yields a variety of alternative Bayes factors. Laplace method asymptotics for general predictive densities are discussed in Section 5 and illustrated for various Bayes factors in Section 6. Concern for the a'cira'y of Lhe Laplace
3
approximations as well as the limitations of these approximations leads us, in Section 7, to seek arbitrarily accurate exact calculations using Monte Carlo methods. Approximate and exact calculations are applied to a nested pair of nonlinear models in Section 8. We conclude with a few summary remarks in Section 9. 2. CLASSICAL APPROACHES In what follows, we shall assume a choice between two parametric models denoted interchangeably by joint density f(y Ii; Mi) or likelihood L(ýi; y, Mi), i = 1, 2 where y is nil and 9. is pil. Since the model choice techniques we consider reduce models to single summary numbers overall selection can be made through such pairwise comparision. Also, in practice models are typically considered in pairs, i.e., model exploration is often evolutionary, modifying a current model to see if improvement ensues. Moreover, classical NeymanPearson theory for testing of models requires pairwise processing. Indeed, a few remarks on classical approaches for model choice seems an appropriate starting point. Informal procedures are generally based upon predictive performance in the form of comparison, in some fashion, of distances between observed values and values predicted under a given model. Occasionally certain optimalities can be ascribed to such procedures. Implementation requires "fitting" the model. Following NeymanPearson theory suppose we create the hypotheses Hi: data y arise from model Mi, i = 1, 2 and set, say H1 as the null hypothesis. If the Mi are completely general there is no optimal test of H1 vs H2 unless both models are fully specified. The formulation of a likelihood ratio test requires an unambiguous specification of a null and alternative hypothesis such as in the nested models case where M1 is the reduced model and M2 is the full model. The likelihood ratio test then takes the form: reject HI if AU < c < 1 where L(0 1 , y, M1 )
L(8 2 ; Y, M ) 2 We assume here and in the sequel a regular case, i.e., the pi remain fixed as n whence, under mild conditions,  2log An is approximately distributed as
2
H1 . With this approximation, inconsistency of the likelihood ratio test arises, that is, lirmP(choose M2 JM1 true) = lim P(An < cIM
Du
fn4 a
1
true)
,
w
under
4
=IimP(2logAn >2log c) =P(X2 namn
27
>  2log c) > 0.
In other words, An tends to be too small; the likelihood ratio test gives too much weight to the full model. As a result numerous authors (see below) have proposed penalizing the likelihood in the form log L(di; y, Mi)  k(n, pi) where k(n, p) > 0, and increasing in n and p. Hence, the full model will be penalized more than the reduced model. We replace log An by the larger quantity log An + k(n, P2 )  k(n, pl). For the
above inconsistency to vanish we need k(n, p2 )  k(n, pl)
a,
as n
4
m. The form
is most common in the literature (though it does not eliminate k(n, p) = mp inconsistency). Values for a in the interval 1 < a < 2.5 appear in e.g. Akaike (1973) and in Bhansali and Downham (1977). Nelder and Wedderburn (1972) suggest r = . Aitkin (1991) suggests
=
log 2.
Choices of k which depend upon
n
and do eliminate
inconsistency include k(n, p) = i log n (Schwarz, 1978), k(n, p) = p fl log (log n), f > 2 (Hannan and Quinn, 1979) and k(n, p) = n log(n + 2p) (Shibata, 1980). 3. THE BAYESIAN FORMULATION The Bayesian model adds a prior specification r(•) to the likelihood specification. Inference is based upon the posterior distribution r(f0y) c L(01, y) •w(0). For Bayesian model choice the two model components may be varied. The case where L is held fixed and i is varied to assess the sensitivity of the posterior to such prior variation is referred to as Bayesian robustness (Berger 1984, 1985). Our intent here is to parallel Section 2, hence to vary L. As such we will assume "noninformative" priors. The formal Bayesian model choice procedure goes as follows. Let wi be the prior probability of Mi, i = 1, 2 and f(yI Mi) the predictive distribution for model Mi, i.e., f(yI Mi) = jf(yI Oi, Mi) " w(i IMi) dOi. f
Tobs
denotes the observed data then we choose the model yielding the larger
wi (YobsI Mi). If wi =
4 we use the Bayes factor (of
M1 with respect to M2 )
(2) f(yobsI M) f(yobs' M2 ) Juffreys (1961), (see also Pettit and Young, 1990) suggests interpretive ranges for BF
the Bayes factor. Foundational arguments (see, e.g., DeGroot, 1970) insist that the only way to compare models is through the "probabilities of these models" and hence, with two models, through the ratio. It is noteworthy that the Bayes factor employs no presumption of nesting and does not require "fitting". In fact, presuming (2) can be calculated, why would one seek alternatives? One criticism is that if ( 9) is improper (as it usually will be under noninformative specification) then f(y) is as well. Hence, we can not interpret the f(yl Mi) as the "probabilities of these models" nor can we interpret the ratio.
Several authors have attempted, primarily in the context of normal data models, to develop a multiplier for BF to overcome this problem (Smith and Spiegelhalter, 1980; Spiegelhalter and Smith, 1982; Pericchi 1984). Pericchi suggests the essence of the problem is that, for a given experiment, the expected increase in information about model parameters varies with the specification of the model and that the multiplier should neutralize this differential. Closely related to this is the fact that even under proper priors with arbitrarily large sample sizes the Bayes factor tends to attach too little weight to the correct model. An illustration is the well known Lindley paradox dating at least to Bartlett (1957). In the nested model case, under usual regularity conditions, we shall show in Section 6 that BF 4 w as n 4m. In other words, regardless of the data, as n grows large, model M1 will be chosen.
The behavior of BF contrasts strikingly with that of An which provides too
much support for M2 ; BF is qualitatively larger than An. This comparison is quantified in an asymptotic sense in Section 6. 4. GENERAL PREDICTIVE DENSITIES The use of predictive distributions in some form has long been recognized as the correct Bayesian approach to model determination. In particular, Box (1980) notes the complementary roles of the posterior and predictive distributions arguing that the posterior is used for "estimation of parameters conditional on the adequacy of the model" while the predictive distribution is used for "criticism of the entertained model in light of the current data". In examining two models, it is clear that the predictive distributions will be comparable while the posteriors will not. Box and others have encouraged a less formal view with regard to Bayesian model choice resulting in alternative predictionist criteria to the Bayes factor. Using cross validation ideas (Stone, 1974; Geisser, 1975) the pseudo Bayes factor (PsBF) arises (Geisser and Eddy, 1979). Aitkin (1991) proposed the posterior Bayes factor (PoBF) while recent work of Berger and Pericchi (1992) introduces the intrinsic Bayes factor (IBF). All
6
are intended to address the aforementioned problems associated with the Bayes factor. The underlying suggestion is that we adopt a broader notion of predictive distributions and densities. In fact, we shall say that a predictive density arises by averaging a density defined over some portion of the sample space (arising from the likelihood) with respect to a distribution on the parameter space (arising from a databased updating of the prior). We assume that the data, y, consists of a set of conditionally independent univariate observations, yj given the parameter 0. However, minor modifications (replacing marginal densities for the yj given
0 with appropriate
conditional densities) permit the handling of more general models. We introduce some notation. Let y , j = 1,...,n be a sequence of independent observations which, under model Mi have density f(yj I i, Mi), i = 1, 2. Let Jn denote the set {1, 2,...,n} and let S be an arbitrary subset of Jn" Define L(Gi; yS, Mi)
n
d.
nfl{f(yj I, Mi)} J
where dj = 1 if jES, = 0 if j 0 S. Finally, let r .(Oi), i = 1, 2 be 1
j=i
.
the prior density for 0. under model Mi with respect to Lebesgue measure. Consider the formal conditional density f(Y I
for
2',S1Mi)) Mi) fL(PyS ,Md)i(ITYS2 )dOi fL(Oi; yS,
Mi) " L(0i; YS2 , Mi)
J=L(Oi; yS
Mi) vi (0 1 1 ) dOi
i(0i)d0i (3)
S1, S2 arbitrary subsets of Jn" The form (3) defines a predictive density which
averages the joint density of
Y1 with respect to the prior for 0. updated by ySS2• We .
take Ys to be a subset of y since for model choice we want a numerical value for (3). Examples of (3) in the literature include: (i)
S1 = Jn' S2 =
(ii)
data. The denominator integral is ignored in this case. S1 = {r}, S2 = Jn  {r} which yields the crossvalidation density f(yrIY(r), Mi)
which yields the standard predictive or marginal density of the
where Y(r) = (Yl' 1Y2 .Yr1'
Yr+l.""Yn), as in Stone (1974) or Geisser (1975).
7
f(yr IY(r), Mi)
evaluated at the observed data is called the conditional predictive n
(iii)
I i) has been The form 11 f(yrly(r), r=1 ~r'M)hsbe proposed as a surrogate for f(y) by Geisser and Eddy (1979). S1 a small subset of J.;, usually two or three elements, S2 = Jn S1 extending
(iv)
the single point deletion in (ii), as in Pefia and Tiao (1992). S1 = Jn, S2 = J. which yields Aitkin's (1991) posterior predictive density. Aitkin
(v)
argues that, unlike f(y) which results from averaging the joint density of y with respect to the prior, one should average with respect to the posterior. S1 = Jn  $2' $2 = 11, 2,...,[pn]} where [.] denotes the greatest integer function.
(vi)
The idea here is that a proportion p of the observations be set aside for prior updating with the remainder to be used for model determination. Such an idea was suggested by Atkinson (1978) and by O'Hagan (1991). S1 = Jn  $2' S2 is a minimal subset (Berger and Pericchi, 1992) i.e., the least
ordinate (CPO), dating to Geisser (1980).
7ri( iyS 2 ) is a proper density. In the regular
number of data points such that
case, the dimension of S2 is fixed regardless of n. Then f(YS1 YS 2 Mi) is proper with, as we will see, with the same asymptotic behavior as f(y; Mi). Notice that, as n . m, (i) and (vi) are qualitatively different from the rest; for (ii) through (v) the cardinality of S2 approaches m as n 4 w. That is, for (ii) through (v), we are averaging against a distribution over the parameter space which, with increasing sample size, places its mass where 9. must be. Hence, it is not surprising that (i) and (vi) exhibit different asymptotic behavior from the rest. The predictive densities (i), (ii), (iv) and (vi) have been employed in the literature to create model selection criteria. Clearly, (i) produces the Bayes factor, BF, given in (2). From (ii), we obtain the pseudoBayes factor, PsBF (Geisser and Eddy, 1979)
•
Fo (i)
n1 f(YrlY(r), MI)
PsBF = r
II f(YrIY(r), M2 ) r we obtain the posterior Bayes factor (Aitkin, 1991), PoBF,
(4)
i
8
PoBF = f(yly, M1 ) f(yIY, M2 )
(5)
Finally, from (vi) we can develop several versions of an intrinsic Bayes factor (IBF) (Berger and Pericchi, 1992). If r is the dimension of the minimal subset and S1 ,I = 1, ,...,(u)
f(ys
indexes the subsets of size r of Jn) then the objects in (vi) are of the form
I ySI, Mi), where Sc denotes the complement of S, relative to Jn. We might
consider the ratio of the averages of such forms, E f(YsCIYS,
I1
1
E f(yScIYS,
M1 )
= BF I M2 )
(f(ys IMI))I I I
E f(Y5s M2 )
7 (7)
or the average of the ratios,
f(yS•]YI,~ MI) I fy
1
1f(ysClys, M2)
f(YS I M2) =
BF. (n)1
(8)
If(YS Ii)
Since (n) can be quite large, Berger and Pericchi suggest instead averaging on the right hand sides of (7) and (8) over a random sample of subsets of size r from Jn" 5. LAPLACE METHOD ASYMPTOTICS We now investigate the asymptotic behavior of (3). Asymptotics are over the sample space through the sampling distribution of estimates of 0i as n 4 w. We employ Laplace method approximations as in Tierney and Kadane (1986) whose form depends upon S1 and ST Let C(S) denote the cardinality of the set S. As n 4® we have three cases
(a)
C(S 1).., C($2) fixed
(b)
C(Sl) fixed, C(S2) 
(c)
C(S 1 ) 0,
C(S 2 ) ,.
Examples (i) and (vi) of section 4 fall into case (a), (ii) and (iii) into case (b); (iv) and (v) into case (c). For case (a) asymptotics apply only to the numerator in (3). For cases (b) and (c) they apply to both numerator and denominator. The basic Laplace approximation is
9
f'e"b(•dG
=
emh(0) (2 ,F)p/2 mP/2IH(,)I
+ O(m)
(9)
where 0 is p%1 with h having unique mode 8 and H(6) is a pip positive definite
matrix such that (H(O)),k = 8^2h(O)/aojDok. For a ratio of integrals with g(O) > 0
Jg(C)emh(O)dO
3
m(h*(e)h(i)) eIH_()I
+O(m)
(10)
where mh*($) = mh(U) + log g(U) with h* having unique mode i?" and H*(O) is pip
positive definite matrix such that (H*(O))k = 02 h*(8)/aOjaOk. We now apply (9) and (10) to (3). In case (a) we use (9) for the numerator of (3) with m = C(SI) and mhi($i) = log L(Oi; yS
2)
1 Let
(11)
M) + log Tr(O).
Mi) + log L(9i,y,
i (S1' S2 ) be the mode of (11). Then for case (a) we have: O(Sl1 1YS 2,
i
L(0i('1,S2); YS 1, Mi)'L( li(S1,S,); YS2, Mi)  ri(ei(S1,S2))
•(27) Pi/2(C(S 1 )) p/21
_H(i(SiS2 ))I i.(f(Ys2; Mi))I.
(12)
(12) has O(n1 ) accuracy. In case (b) we take g(i) = L(9i; ys1. Mi) with m = C(S 2 ) and
mbi(Pi) = log L(Oi; y$
,
(13)
Mi) + log ri(Oi).
Let 'i($2) denote the mode of (13). Applying (10) we have for case (b) f(yS 1YS 2, Mi) 
L( #i(S1,2);Ys1,Mi) L(Oi(S 1 ,S2 ); YS 2 Mi) d L( O(SaTh2a 2 );
(14) has O(n
) accuracy.
Md)
i(Oi(S1,S 2 )) ia(se(Spo)
fIH
1 f(Oi(
1 ,$ 2 ))I'
aiHn (ti(S 2 ))
(1
To handle case (c) assumptions regarding the rates at which the
10
C(Si) •
as
n
'
w
are required.
If, as in examples (iv) and (v), we have
1im nw
C(S1)/C(S2) = k, then the same approximation as in case (b) arises. Suppose the usual regularity conditions hold on the likelihoods, that if
i.
L(O6; y, Mi),
so
is the maximum likelihood estimator of 9. based upon a sample of size n we
have 9.in
i46.,o, for some 10 and n
l(6og L(O;y,Mi)
where
I(8) denotes Fisher's information matrix. Suppose we assume that 0i($,$2) maximizes maximizes (13) with log li(Gi) deleted. (11) with log 7ri(Oi) deleted and that Then, provided C($2) = O(n), we also have
in'
+
(
) where
can be any of
all of these modes differ by Op(n1). 9i(S 1 ,S2 ), BP(S 2 ) or 8i(S1,$2) whence i(S$2), Moreover, if i is continuous fi(.) = Ti(oin) + Op(nl). Finally, H(0i) I(0i), and, in fact,
H(0in) P,
1,n041
asymptotically negligible so
I(9. ).
As for
H*(Oi) 
H*(0i), in case (b),
.n
is
For case
1(i) and also H*(0)
(c), if C(SI) = O(n) then H*(0i) P 21(0i) and also H*(in) replace
L(; ys1, Mi)
21 (8,io).
We can
by • in either H or H* with the same result holding. In fact, substituting
asymptotically equivalent estimators of 0i in (14), still yields O(n1 ) accuracy. 6. ASYMPTOTICS FOR VARIOUS BAYES FACTORS Applying (12) to
M1
and
M2
and using the asymptotics at the end of the
previous section, we obtain BF = L(G1 n;Y, M1 ) 7rl(l' n)D[I HI(^1,n)1
n (P2Pl )/2
L(9'2 ,;y, M2 ) 2092,n) [IH2 (92,nd' whence, with obvious definition of K(01,I, )2,n), p2Pl log BF z log An + = log n + K(0 1 ,n, 0 2,n)
(16)
Expression (15) appears in Kass and Vaidyanathan (1992). Expression (16) precisely reveals the difference in asymptotic behavior between An and F in the nested
11
case; K = 0(1) so the Bayes factor will be consistent but will exhibit Lindley's paradox. Also, apart from K (16) yields Schwarz's (1978) BIC adjustment of An. If we ignore K in choosing a model, we can be misled since K need not be negligible (see section 8). Such concerns are pertinent to the ensuing approximations for the other variants of the Bayes factor and encourage exact calculation as discussed in section 7. The pseudo Bayes factor provides an interesting case for asymptotic approximation. It is built from the cross validation distributions f(yrly(r), Mi) which lend themselves to a variety of approximations based upon (14) and the different but asymptotically equivalent estimators of 0G discussed above. Suppose, for instance, in the numerator of (14) we replace Oi(S
1 ,$2 )
by Pn and in the denominator we replace
ei(S 2 )
by
.,
the MLE of 0. based upon the data with yr removed. We obtain Ir
n
1
lI1 f(y.j IVinM ) 7ri i(U)OIll(H (o.if(YrIY(r)M
I
j#n rf(yr1
In
,r),M.) (r))
H1'( (
$,n
14 .r()1
Since the second and third ratios on the right hand side of (17) tend to 1, we also have rI f(YjVin' Mi) f(yrIY(r), Mi)
j1 j Orf(Y
(18)
i•(r
and HI f(yj[I0i,n, Mi) If f(Yr IY(r),I Mi)d r
I r
(19a) 11rf(y I P.(r)
0j rM
,nMi)
Two obvious simplifying approximations of (19a) are 'I f(yrIy(r), Mi) z 11 f(yrI Oi, r r
Mi) = L(Oi,n; y, M.)
(19b)
and nf1 (YrIY(r)' Mi) z 17 f(yr
in
Mi).
(19c)
Stone (1977) and Geisser and Eddy (1979) discuss these approximations calling (19b) the predictive likelihood and (19c) the quasipredictive likelihood. Let us compare all three
12
approximations. Stone looks at (19b) and (19c), and argues that lina (log II f
.(r , M1 ) log
I1f ( y rI
n4
1,
r
r
',n
M))
Pi
We can readily compare (19a) and (19c). After some manipulation
1. f(yj[ 0in, log 11
rr, I j (.. j#r J i'n
Mi)

log1°
r
I (r)' Mnf(y M
rA,n
= X(2log f(yjI i,n MI)log f(yrIe.(), M1))
r j
=
0. )T
1 in
0 i,n)
in
(20)
where Hi is the Hessian matrix of L(0i; y, Mi) and r(r) lies between 1whr
I,n
0 .(r)
and 0.
)n
By standard argumentation for quadratic forms we may show that, as n 1w, (20)
.
pi/2.
Finally, using the definition (4), the approximation (19a), and the above limiting relationships between (19a), (19b) and (19c) we obtain log PsBF z log A + "2 
(21)
Hence, using the approximation (19a) we obtain the Nelder and Wedderburn (1972) adjustment of An as in Section 2. Stone observes that using the approximation (19c) we obtain the customary AIC adjustment (o = 1) of An (Akaike, 1973). Turning to the PoBF and recalling earlier discussion about the case (c) asymptotics, suppose in (14) we replace 0.(SS2) and .(S2) with 8i9n* Then we obtain SPoBF z An 2(p 2 p 1 )/2 and log PoBF z log An +
2"
log 2.
(22)
Result (22) along with some additional asymptotic calculations are provided in Aitkin (1991). Hence, the correcton of the likelihood associated with the PoBF falls below that of Nelder and Wedderburn which, in turn, falls below the customary ,IC adjustment. Neither the PsBF nor PoBF will suffer the Lindley paradox.
13
We next consider the IBF. From expressions (7) and (8), given (15) or (16) the asymptotic behavior of the IBF depends upon that of the right hand side summations. Though the components of the sums are not necessarily independent, in many cases a "law of large numbers" argument will produce a limiting constant whence, asymptotically the IBF behaves like a multiple of the Bayes factor. 7. EXACT CALCULATIONS USING MONTE CARLO METHODS The accuracy of the previous analytic approximations is unknown in practice. Additionally, these approximations do not produce functional forms since required modes can rarely be obtained as explicit functions of the data. Rather, for a given observed sample, P's of O's would be calculated numerically yielding a numerical value for (3). Therefore, we can not study the behavior of or features of such predictive distributions. A samplingbased approach is attractive in avoiding the above difficulties. Such simulation approaches might be noniterative as in standard Monte Carlo (see e.g. Geweke, 1989) or iterative as for example using the Gibbs sampler or other Markov chain Monte Carlo techniques (see e.g. Gelfand and Smith, 1990; Tierney, 1991). We present no detail regarding these techniques. Rather we just describe how they enable arbitrarily accurate estimates of (3) as a function ofYs1 or of expectations associated with (3). Suppose g( 8) is taken as an importance sampling density for L( i; ys2; Mi)7ri( i).
If •j' j = ,...,Bi is a sample from g and we define wij = L(#ij; YS2 ; Mi) 7(9ij)/g(ýij) then a Monte Carlo integration for (3) is i) 1= ; L(O*i,
f(yS IYS, 1
21
ySl; Mi)  w../E wij. 1' j(23 M1 y
(23)
If a Markov chain Monte Carlo technique has been used, the output is usually taken to be a sample 0,j, j = 1,...,Bi from the posterior vr(jIy). But then we can take the posterior as the importance sampling density in (23) resulting in the approximant E 1
where S
2
1 .L(iJ';Ysc'Mi)

L(ýiJ; Ys1'Mj) J L(uiJ; ySMi)
(24)
J  S2 . The estimator (24) is routine to calculate and simulation consistent.
14
(L(Mij, Y2 Mi
However, its precision depends upon the stability of the weights wi that is, upon how good an importance sampling density 7r•( tIy) is for
7ri(
IYs2 ). In the
case of the PoBF it is perfect and the resulting estimate takes the simple form B2
B1 E L(Ol'j; ySI' M1 )/BI. j=1 3
(I
L(*2j; ySI' M2)/B2)j=l 1
In the context of cross validation we would expect
xi(OiOy)
(25)
to be a good importance
sampling density for each iri(Oi Y(r)) and (24) becomes a harmonic mean
f(yrIY(r), Mi) = B.(E
(26)
1
SJfr(yrI
Mi)
from which the PsBF can be straightforwardly calculated. Consider the special case of f(y, Mi). Here, if r(O) is a proper density, (f(y, Mi)
1
= J
ML(; y, Mi)7rj(#i)
•i() .ri.(Gly) dO1.
Thus, our estimator becomes
O )
f(y, Mi)  ( ' L(;
y, M
1
(27)
J
In (27) r plays the role of an importance sampling density and natural choices to "match" the posterior would be multivariate normal or t densities with mean and covariance computed from the Oi'.'s. If 7ri is proper we could take it as 7 obtaining an estimator of f(y; Mi) proposed by Newton and Raftery (1991). Finally, suppose f(Ys 1Ys 2 ; Mi)
is proper and we seek the expectation of h(ySl)
with respect to this density. Suppose the conditional expectation
al(Oi) = E(h(ys )I Oi,
Mi) with respect to L(i; ySl, Mi) is available explicitly. Then since Eh(YSI2; Mi) SJal(0i)
ri (OilYs2) dOi =
15 "7i(ýiiVs2(.) ri(.I y) dei = f(y Jyss; Mi) "fal(Oi) I dOi 2 L(O,;ysC) Mi)
f al(Oi) iOLy
a Monte Carlo integration for this expectation using (24) takes the form E~~y Y ;Mi) 
1
2
al(Oij)/L(Oij; ys C,Mi)]/ E 1/L(Oi j; y c, Mi)
2
JJ
If al(fi) is unavailable explicitly, but yi
ryS
then E h(yS
1 draw Y9

; Mi)
2
,j
s
" f(YS IYS'
where
8ij, are a sample from
Mi) it suffices to
21
1
111i
L(Oij; yS1, Mi)
M2
j = 1.."'Bi is a sample from f(yS 1 YS2; Mi)
B 1 j h(yl ). To draw yS
1
(30)
ri(I yS2).
Smith and
Gelfand (1992) discuss converting a sample from one posterior to that from another. In the present case, we must convert the 0ij from ri(#.Iy) to Of from 7i(0i1yS 8. A NUMERICAL EXAMPLE We consider a data example where the objective is to choose between nested nonlinear models. We employ both asymptotic results (Section 6) and exact calculation (Section 7) to numerically compare An and the various Bayes factors. The data concerns the steady state adsorption of oxylene as a function of oxygen concentration, inlet oxylene concentration and temperature. The sample of 57 points is presented along with the full model in Bates and Watts (1988, p. 306309). The full model M2 is, in fact, yj = bi~xj, D1M 2 ) + Ej where xj is 3xl and the error E is assumed independent N (0, e 2 ). Here bi(x, 0) = b1 b2 /(b 1 + .22788 b2 ) with bl (x, I) = 01 x 1 exp(0 3 /x 3 ) and b2 (xl 0) = 02 x 2 exp(0 4 /x 3 ). A convenient reduced model, M1 , sets 03 = 04 yielding yj = br(xj, 01M 1 ) + exp(O 3 /x 3 )/(0 1 xI + 2.2788 02 x2 ). Hence, we take a flat prior on
=i =
where br(x, 01MI) = #1 02 Xlx 2
The modeling implicitly assumes that all
log Oi independent of the prior r(a2)
0i > 0.
r,2 on a
A nonlinear regression fitting package (SAS PROC NLIN) was used for M1 and for M2 to obtain the maximum likelihood estimates of all the parameters and thus to compute the likelihood ratio statistic A
P2 )/2} which turns out to be .004.
(;22/;2)n/2exp{p n
2
exP1P2)/}
wichturs
ou
tobe
004
16
The maximum likelihood estimates and their asymptotic covariance were used to obtain, for each model, a multivariate tdistribution which served as an importance sampling density for a noniterative Monte Carlo approach. "Exact" calculations based on 10,000 simulations along with asymptotic approximations axe given in Table 1.
"Exact" Calculation Asymptotic Approximation"
log An
log BF
log PSBF
log PoBF
5.52
4.94
5.13
5.27
3.50
5.02
5.17
Table 1. Monte Carlo estimates and asymptotic approximations for model selection criteria Table 1 indicates that regardless of the criterion, the full model is overwhelmingly selected. The asymptotic approximation for BF is poor suggesting that for such very nonlinear models the sample size 57 is not large enough; K in (16) is not negligible. More precise approximation using (15) is nontrivial in the present case and seems hard to justify given the relative ease with which the simulation approach can be implemented. 9. CONCLUDING REMARKS Our effort here has focused on modeling situations where pi < < n, regular models.
socalled
Though this encompasses a broad range of classical problems, much of
contemporary Bayesian modeling considers hierarchical or structured random effects models. In such cases pi or P2  P1 can tend to w as n , m. Among the disasters which befall us in such nonregular problems are: i) all of the asymptotics presented here break down ii) parameters may not be consistently estimated (a matter which was recognized as early as Neyman and Scott, 1948) iii) under improper priors the posterior need not be proper (a form of Bayesian nonidentifiability). Attention to the prior specification can remedy (ii) and (iii) but not (i). Hence, the model choice criteria discussed here require exact calculation through the approaches of Section 7. In this regard a valuable supplement to these experiment4evel criteria is investigation of model performance at the observation or case level.
REFERENCES Aitkin, M. (1991). Posterior Bayes factors. J.R. Statist. Soc. B, 1, 111142.
17
Atkinson, A.C. (1978). Posterior probabilities for choosing a regression model. Biometrika, 65, 3948. Akaike, H. (1973). Information theory and the extension of the maximum likelihood principle. In: Proc. 2nd Int. Symp. Information Theory (eds. B.N. Petrior and F. Csaki), 267281. Budapest: Akademiai Kiado. Bartlett, M. (1957). 533534.
A comment on D.V. Lindley's statistical paradox, Biometrika 44,
Bates, D.M. and Watts, D.G. (1988). Nonlinear Regression Analysis and Its Applications. John Wiley and Sons. Berger, J. (1984). The robust Bayesian viewpoint. In: Robustness of Bayesian Analysis, J. Kadane, ed., p. 63124, North Holland, Amsterdam. Berger, J. (1985). New York.
Statistical Decision Theory and Bayesian Analysis, SpringerVerlag,
Berger, J.O. and Pericchi, L.R. (1992). The intrinsic Bayes factor. Department of Statistics, Purdue University.
Technical Report,
Bhansali, R.J. and Downham, D.Y. (1977). Some properties of the order of an autoregressive model selected by a generalization of Akaike's FPE criterion. Biometrika 64, 547551. Box, G. (1980). Sampling and Bayes' inference in scientific modeling and robustness (with discussion). J.R. Statist. Soc. A, 143, 382430. DeGroot, M.H. (1970). Optimal Statistical Decisions. McGrawHill, New York. Geisser, S. (1975). The predictive sample reuse method with application. J. Amer. Statist. Assoc. 70, 320328, 350. Geisser, S. (1980). Discussion to paper of G.E. P. Box. J.R. Statist. Soc. A, 143, 416417. Geisser, S. (1988). The future of statistics in retrospect. In: Bayesian Statistics, 3, (J. Bernardo, et. al., eds.). Oxford University Press, Oxford, 147158. Geisser, S. and Eddy, W. (1979). A predictive approach to model selection. Statist. Assoc., 74, 153160.
J. Amer.
Gelfand, A.E. and Smith, A.F.M. (1990). Samplingbased approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85, 398409. Gelfand, A.E., Dey, D.K. and Chang, H. (1992). Model determination using predictive distributions with implementation via samplingbased methods. In: Bayesian jtalitif,L4 (J. Bernardo, et. al., eds.). Oxford University Press. Oxford, 147167. Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57, 13171339. Hannan, E.J. and Quinn, B.G. (1979). The determination of the order of an autoregression. J.R. Statist. Soc. B, 41, 190195.
18
Jeffreys, H. (1961). Theory of Probability (3rd Edition). Oxford University Press, London. Kass, R. and Vaidyanathan, S.K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. J.R. Stat. Soc. B, 54, 129144. Nelder, J. and Wederburn, R.W.M. (1972). A, 135, 370384.
Generalized linear models. J.R. Statist. Soc.
Newton, M.A. and Raftery, A.E. (1991). Approximate Bayesian inference by the weighted likelihood bootstrap. Technical Report 199, University of Washington. Neyman, J. and Scott, E. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16, 132. O'Hagan, A. (1991). Discussion to paper of M. Aitkin. J.R. Statist. Soc. B, 136. Pefia, D. and Tiao, G.C. (1992). Bayesian robustness functions for linear models. Bayesian Statistics 4 (J. Bernardo, et. al. eds.) Oxford University Press, 365389. Pericchi, L. (1984). An alternative to the standard Bayesian procedure for discrimination between normal linear models. Biometrika, 71, 576586. Pettit, L.I. and Young, K.D.S. (1990). Measuring the effect of observation on Bayes factors. Biometrika, 77, 3, 455466. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461464. Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. Ann. Statist. 8, 147164. Smith, A.F.M. and Gelfand, A.E. (1992). Bayesian statistics without samplingresampling perspective. Amer. Statist. 46, 2, 8488.
tears: a
Smith, A.F.M. and Spiegelhalter, D. (1980). Bayes factors and choice criteria for linear models. J.R. Statist. Soc. B, 42, 213220. Spiegelhalter, D. and Smith, A.F.M. (1982). Bayes factors for linear and loglinear models with vague prior information. J.R. Statist. Soc. B, 377387. Stone, M. (1974). Crossvalidatory choice and assessment of statistical predictions. J.R. Statist. Soc. B, 36, 111147. Stone, M. (1977). An asymptotic equivalence of choice of model by crossvalidation and Akaike's criterion. J.R. Statist. Soc. B, 39, 4447. Tierney, L. and Kadane, J.B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc., 81, 8286. Tierney, L. (1991). Markov chains for exploring posterior distributions. Technical Report 560, University of Minnesota
U4C'LUSSIFIED :ECURIY CLASSIFICATION Of THIS PAGE r~o
Dmi aseremo
1
REPORT DOCUMENTATION PAGE 1.REPORT
NUMagER
TITLE (and
4.
BEFORE COPEIGFORM
a. GOVT ArCCESiON No] 3.
&aEIM018o
MECIPIENT'S CATALOG N#uoSER
S. TYPE Of REOR
Bayesian Model Choice:
Asymptotics & Exact
Technical 6.
7.
N002592J1264
PERFORMING ORGANIZATION NAME AND AGGRIESS
10. PROGRAM IELEMEN0T. PROJECT. TASK ARE9A 4 WORK UNIT MNUOZ*S
Department of Statistics Stanford University Stanford, CA 943054065 I1.
NR042267
CONTROLLING OFFICE NAME AND AOORESS
12. REPORT DATE
Office of Naval Research Statistics & Probability Program Code
PERFORMING OAG. REPORT NUMOILM
11. CONTRACT OR GRANT NdUM11E01(s
AUTHIOR(@)
A. E. Gelfand and D.K. Dey 9.
A PIRIoo COVERED
illS~Idll..
1993A
i
1.'mER OF PAGES
I2
i.Cwuaae
14 MONITORING AGENCY NAME I
I i
.jinip
13. ..
DRSit91emsboCmilnCoffece)
SECURITY CLASS. (attis 00000pFI)
15.
Unclassified OIECLASSIPISCATICONI
isa.
OUNGIAOINOS
SCHEDUJLE IS.

DISTRIBUTION STATEMENT (of this Roe.sj
Approved for public release;

1T.
distribution unlimited.
OiSTRIGUTION STATEMENT (of the oeb.,mt @moored in alle.. 20. Itdflled~t fre.
It. SUPPLEMMUTARY NOTES
a.,."o)
THE VIEw, OPINIONS. AND/OR FINDINGS CONTAINED IN THIS REPORT ARE THOSE OF THE AUTHOR,'S) AN4D SHOULD !VCT BE CONSTRUED AS AN OFF;CIAL DEPARTMENT OF THE ARMY POS:T:O:*,:. POLICY. CR DECISION, UNLESS SO DESICNATED 13Y OTHER DOCUM.ENTATION.
is.
KEY WORDS (ConiMmo. an rversee 48do to no..." mi fdowaIu
or'Ieeko .wme.)
Key wards: Bayes hactor, Laplace approdmations, Likelihood ratio statistics, Monte Carlo methods. SO.
AMSTSARACT (CoItm.
an rewers side it 00000oe.
a"d IdSikoifa,' 0104
wwe)
See Reverse Side
DOW 147
0 214 6u 1v1
UNCLASSIFIEDobi $11CUPR1TV C16AUIICATION OF TWOS PAC
(
De ý'i¶ce~
Bayesian Model Choice: Asymptotics and Exact Calculations A.E. Gelfand and D.K. Dey*
SUMMARY Model determination is a fundamental data analytic task. Here we consider the problem of choosing amongst a finite (with loss of generality we assume two) set of models. After briefly reviewing classical and Bayesian model choice strategies we present a general predictive density which includes all proposed Bayesian approaches we are aware of. Using Laplace approximations we can conveniently assess and compare asymptotic behavior of these approaches. Concern regarding the accuracy of these approximation for small to moderate sample sizes encourages the use of Monte Carlo techniques to carry out exact calculations. A data set fit with nested non linear models enables comparison between proposals and between exact and asymptotic values.