_{1}

For any statistical analysis, Model selection is necessary and required. In many cases of selection, Bayes factor is one of the important basic elements. For the unilateral hypothesis testing problem, we extend the harmony of frequency and Bayesian evidence to the generalized p-value of unilateral hypothesis testing problem, and study the harmony of generalized P-value and posterior probability of original hypothesis. For the problem of single point hypothesis testing, the posterior probability of the Bayes evidence under the traditional Bayes testing method, that is, the Bayes factor or the single point original hypothesis is established, is analyzed , a phenomenon known as the Lindley paradox, which is at odds with the classical frequency evidence of p-value. At this point, many statisticians have been worked for this from both frequentist and Bayesian perspective. In this paper , I am going to focus on Bayesian approach to model selection , s tarting from Bayes factors and go ing within Lindley Paradox, which also briefly talks about partial and fractional Bayes factor. Trying to use a simple way to consider this paradox is the thing what I want to do in the paper. On the other hand, a detailed derivation of BIC and AIC is given in S ection 4. The guiding principle of selecting the optimal model is to investigate from two aspects: one is to maximize the likelihood function, the other is to minimize the number of unknown parameters in the model. The larger the likelihood function value, the better the model fitting, but we can not simply measure the model fitting accuracy, which leads to more and more unknown parameters in the model, and the model that becomes more and more complex would have caused an overmatch. Therefore, a good model should be the combination of the fitting accuracy and the number of unknown parameters to optimize the configuration.

Many statisticians are naturally involved in the question of model selection [

P ( M k | D a t a ) ∝ π k ∫ Ω k f k ( D a t a | θ k ) g k ( θ k ) d θ k = π k ∫ Ω k f k ( D a t a | θ k ) g k ( θ k ) d θ k ∑ j = 1 K π j ∫ Ω j f j ( D a t a | θ j ) g j ( θ j ) d θ j

In a Bayesian analysis, the priors π k on each model and g k ( θ k ) on the parameters of model k are proper and subjective. And the Bayesian solutions to do questions are to compute the posterior probability P ( M k | D a t a ) for each model. For model selection, we would choose the model from Bayesian conclusion as maximizes P ( M k | D a t a ) .

However, Bayes factor has its only limitation, that is Bayes factors itself can only show the difference of how hypothesis model is against a null model [

In Section 2, we give a simple and general explanation of Bayes factor. Following, in Section 3, we will talk about Lindley’s Paradox. And Section 4 can be one of the main parts of the theoretical approach for AIC and BIC, for which we give the derivation. A simple example is given as well to use AIC and BIC.

Before talking about all things, first we would construct one of the most important variables within Bayesion Methods-Bayes Factor [

Suppose we have data D with prior θ and M 1 , M 2 as two different models. By Condition Rule, we have:

P ( M 1 | D ) = P ( D | M 1 ) P ( M 1 ) P (D)

Recall for Odds we have P ( M 1 ) = 1 − P ( M 2 ) . And for P ( D | M 1 ) is the marginal likelihood, which P ( D | M 1 ) = ∫ P ( D | θ , M 1 ) P ( θ | M 1 ) d θ . θ denotes prior. Then, by Bayes’ Rule,

P ( M 1 | D ) = P ( D | M 1 ) P ( M 1 ) P ( D | M 1 ) P ( M 1 ) + P ( D | M 2 ) P (M2)

P ( M 1 | D ) P ( M 2 | D ) = P ( D | M 1 ) P ( D | M 2 ) P ( M 1 ) P (M2)

where P ( D | M 1 ) P ( D | M 2 ) is defined as Bayes’ factor, and realized it is also the ratio of marginal likelihood. Furthermore, we denote Bayes’ factor as:

B 1,2 ( y ) = P ( y | M 1 ) P ( y | M 2 )

Bayesian method fits in many models for testing because it can provide a decisiveness of the evidence agree the null model in contrast p-values [

B 2,1 = ∫ P 2 ( D | θ 2 ) π 2 ( θ 2 ) d θ 2 ∫ P 1 ( D | θ 1 ) π 1 ( θ 1 ) d θ 1

Above these, evidence from the data agrees H 2 , against H 1 . So Bayes factor can avoid many limitations in p-value testing. The development of Bayes factor in statistical models test can applicate in many areas of research [

The Lindley’s Paradox shows how a value (or the number of standard deviations) is used in a Frequent Assumption [

When we faced with improper priors (like priors can’t integrate to one) in the null hypothesis and model selection, we will find some problems. Such priors can be acceptable, but for other purposes it is also acceptable. So we consider testing the hypotheses:

H 0 : θ ∈ Θ 0 vs H 1 : θ ∈ Θ 1

Defining θ for marginal density, so we can use the following model:

p ( θ ) = p ( θ | H 0 ) p ( H 0 ) + p ( θ | H 1 ) p (H1)

Making p ( θ | H 0 ) and p ( θ | H 1 ) are proper density functions, the posterior is given by:

p ( H 0 | D ) = p ( H 0 ) p ( D | H 0 ) p ( H 0 ) P ( D | H 0 ) + p ( H 1 ) p ( D | H 1 ) = p ( H 0 ) ∫ Θ 0 p ( D | θ ) p ( θ | H 0 ) d θ p ( H 0 ) ∫ Θ 0 p ( D | θ ) p ( θ | H 0 ) d θ + p ( H 1 ) ∫ Θ 1 p ( D | θ ) p ( θ | H 1 ) d θ

Then we can suppose that we use improper priors, making p ( θ | H 0 ) ∝ z 0 and p ( θ | H 1 ) ∝ z 1 . So:

p ( H 0 | D ) = p ( H 0 ) z 0 ∫ Θ 0 p ( D | θ ) d θ p ( H 0 ) z 0 ∫ Θ 0 p ( D | θ ) d θ + p ( H 1 ) z 1 ∫ Θ 1 p ( D | θ ) d θ = p ( H 0 ) z 0 s 0 p ( H 0 ) z 0 s 0 + p ( H 1 ) z 1 s 1

Establishing model i that s i = ∫ Θ i p ( x | θ ) d θ is the marginal likelihood or the integrated. So we assume that p ( H 0 ) = p ( H 1 ) = 1 2

Then an equation can be obtained:

p ( H 0 | D ) = z 0 s 0 z 0 s 0 + z 1 s 1 = s 0 s 0 + ( z 1 z 0 ) s 1

So we can use different z that we want to change the posterior arbitrarily. Meanwhile, when using proper and not clear priors might cause similar problems. Because the probability of data in a complex model with a diffuse prior will be very small. So one thing we must know, when we do research in Bayes factor a clearer and simper model is better. It was called the Lindley paradox.

Many authors [

H 0 : θ = 0 vs H 1 : θ ≠ 0

In normal model N ( x | θ ,1 ) . The prior probability is p ( H 0 ) ( p ( H 0 ) = ρ 0 ) .

Let π ( θ ) = N ( θ | 0 , σ 2 ) ( σ > 0 ) be the prior distribution for the unknowm parameter θ in the model.

The Bayes factor is given by:

B = N ( x | 0 , 1 ) ∫ N ( x | θ , 1 ) π ( θ ) d θ

In order to consider the paradox, we can formalise it and compare the two following normal models:

M 0 = { N ( x | 0 , 1 ) = ( 2 π ) − 1 2 exp ( − 1 2 x 2 ) }

M 1 = { N ( x | θ , 1 ) = ( 2 π ) − 1 2 exp { − 1 2 ( x − θ ) 2 } }

Consider a physical system where quantity X may be measured and assume. And we need to use the σ to define both the priors. The prior of the null hypothesis is ρ 0 supposing the ρ 0 can depend on σ .

Computing the Bayes factor representing the odds of the null hypothesis H 0 is:

B 0 = N ( x | 0 , 1 ) ∫ N ( x | θ , 1 ) N ( θ | 0 , σ 2 ) d θ = e − 1 2 x 2 e − 1 2 x 2 σ 2 + 1 σ 2 + 1

In this case, prior probabilities p ( H 0 ) and p ( H 1 ) = 1 − P ( H 0 ) for two hypotheses can be expressed. Given the result x, in Bayes theory that:

p ( H m | x ) p ( x ) = p ( x | H m ) p (Hm)

for m = 0 , 1 , p ( H m ) is prior probabilities and p ( x | H m ) is the conditional distribution, p ( x ) = p ( x | H 0 ) P ( H 0 ) + P ( x | H 1 ) p ( H 1 ) can outcome the overall distribution. Posterior probability p ( H m | x ) is in the hypothesis H m . In Bayes theory we can evaluate the posterior probabilities, p ( H 0 | x ) is given by:

p ( H 0 | x ) = p ( x | H 0 ) p ( H 0 ) p ( x ) = p ( x | H 0 ) p ( H 0 ) p ( x | H 0 ) p ( H 0 ) + p ( x | H 1 ) p ( H 1 ) = [ 1 + p ( x | H 1 ) p ( H 1 ) p ( x | H 0 ) p ( H 0 ) ] − 1 = [ 1 + 1 − p ( H 0 ) p ( H 0 ) p ( x | H 1 ) p ( x | H 0 ) ] − 1

Then, we can use the mean value in prior distribution with π m ( θ ) ≡ p ( θ | H m ) and make the rest of the prior probability as a normal distribution with variance τ , so:

π 1 ( θ ) = ( 2 π τ 2 ) − 1 2 exp { − θ 2 2 τ 2 }

Evaluating the conditional probabilities:

p ( x | H m ) = ∫ π m ( θ ) p ( x | θ ) d θ

We can evaluate p ( x | H 0 ) and p ( x | H 1 ) , overall:

p ( H 0 | x ) = [ 1 + 1 − ρ 0 ρ 0 e − 1 2 x 2 / ( σ 2 + 1 ) e − 1 2 x 2 1 σ 2 + 1 ] − 1 = [ 1 + 1 − ρ 0 ρ 0 1 B 0 ] − 1

So we have an equation like before, we can talk about the prior ρ ( σ ) . Our approach is to measure the value of alternative assumptions about zero. In Asymptotically Bayesian attribute, if the model is incorrectly specified, the posterior will accumulates in the model. In the case of the Kullback-Leibler divergence, the closest to the real model [

∫ Θ D K L ( N ( x θ ,1 ) ∥ N ( x | 0,1 ) ) π ( θ ) d θ = ∫ 1 2 θ 2 π ( θ ) d θ = 1 2 σ 2

The model prior represent the loss relatied with a probability statement, it also determined self-information loss function. So we have the prior on the alternative model is:

1 − ρ 0 ( σ ) ∝ e σ 2

The prior of the null hypothesis is ρ ( σ ) ∝ 1 , then we can get:

ρ 0 ( σ ) = 1 1 + exp { 1 2 σ 2 }

Then, this applies to the category of large σ and p ( H 0 | x , σ ) goes to zero, so p ( H 0 ) → 0 . Therefore, this method is consistent, we do not advocate the choice of big σ .

y | observed data y 1 , ⋯ , y n |
---|---|

M i | candidate model |

θ i | vector of parameters in the model |

g ( θ i ) | the prior density of the parameters θ i |

P ( y | M i ) | marginal likelihood |

f ( y | θ i ) | the density of the data given |

L ( θ i | y ) | the likelihood of y given the model M i |

In this section we are going to talk about the basic idea [

As what we have showed in section one, B 1,2 ( y ) = P ( y | M 1 ) P ( y | M 2 ) as Bayes factor for two models, then we consider more models M i which i ∈ { 1, ⋯ , n }

P ( y | M i ) = ∫ f ( y | θ i ) = ∫ exp ( log ( f ( y | θ i ) g i ( θ i ) ) ) d θ i

f ( y | θ i ) g i ( θ i ) where θ i is the vector of parameters in the model M i , L is the likelihood function and g i ( θ i ) is the p.d.f. of the distribution of parameters θ i

Denoting θ ˜ i as the posterior mode, then we use Taylor expansion, let Q ( θ i ) = log ( f ( y | θ i ) g i ( θ i ) ) , Q ( θ i ) = log ( f ( y | θ i ) g i ( θ i ) ) ≈ log ( f ( y | θ ˜ i ) ) g i ( θ ˜ i ) + ( θ i − θ ˜ i ) ∇ θ i Q | θ ˜ i + 1 2 ( θ i − θ ˜ i ) T H θ i ( θ i − θ ˜ i ) .

where H θ i is a | θ i | | θ i | matrix such that H m n = ∂ 2 Q ∂ θ m ∂ θ n | θ ˜ i , where | θ i | = d i = d i m e n s i o n ( θ i ) . since Q attains its maximum, the Hessian matrix H θ i is negative definite. Let us denote H ¯ θ i = − H θ i , and then approximate P ( y | M i ) :

P ( y | M i ) ≈ ∫ exp { Q | θ ˜ i + ( θ i − θ ˜ i ) ∇ θ i Q | θ ˜ i + 1 2 ( θ i − θ ˜ i ) T H θ i ( θ i − θ ˜ i ) } d θ i

Then, by higher dimension normal distribution,

∵ ∫ 1 ( 2 π ) d i 2 | H ˜ θ i − 1 | 1 2 exp ( − 1 2 ( θ i − θ ^ i ) Τ H → θ i ( θ i − θ ^ i ) ) d θ i = 1

⇒ ∫ exp ( − 1 2 ( θ i − θ ^ i ) Τ H → θ i ( θ i − θ ^ i ) ) d θ i = ( 2 π ) d i 2 | H ˜ θ i − 1 | 1 2

⇒ P ( y | M i ) = f ( y | θ ˜ i ) g i ( θ ˜ i ) ( 2 π ) d i 2 | H ˜ θ i − 1 | 1 2

⇒ log P ( y | M i ) = log f ( y | θ ˜ i ) + log g i ( θ ˜ i ) + d i 2 log ( 2 π ) + 1 2 log | H ˜ θ i − 1 |

Furthermore, let us think about Weak Law of Large Numbers. For y is given data, f ( y | θ i ) is the likelihood L ( θ i | y ) and L attains its maximum at the maximum likelihood estimate θ i − θ ^ i .

We set g i ( θ i ) = ( 1, θ ∈ [ θ ˜ i − 1 2 , θ ˜ i + 1 2 ] 0, else , then each element in the matrix, H ˜ θ i , can be expressed as:

H ˜ m n = − ∂ 2 log L ( θ i | y ) ∂ θ m ∂ θ n | θ i = θ ^ i

Then, for H ˜ θ i as a Fisher information matrix that,

H ˜ m n = − ∂ 2 log ( ∏ j = 1 n L ( θ i | y j ) ) ∂ θ m ∂ θ n | θ i = θ ^ i = − ∂ 2 ∑ j = 1 n log L ( θ i | y j ) ∂ θ m ∂ θ n | θ i = θ ^ i = − ∂ 2 ( 1 n ∑ j = 1 n n log L ( θ i | y j ) ) ∂ θ m ∂ θ n | θ i = θ ^ i

In this case, for the data y 1 , ⋯ , y n is IID, and n is large, we would apply Weak Law of Large number here, as random variable X j = n log L ( θ i | y j ) we have 1 n ∑ j = 1 n n log L ( θ i | y j ) → P E ( n log L ( θ i | y j ) ) , Moreover, for Fisher information matrix:

H ˜ m n = − ∂ 2 E [ n log L ( θ i | y j ) ] ∂ θ m ∂ θ n | θ i = θ ^ i = − n ∂ 2 E [ log L ( θ i | y j ) ] ∂ θ m ∂ θ n | θ i = θ ^ i = − n ∂ 2 E [ log L ( θ i | y 1 ) ] ∂ θ m ∂ θ n | θ i = θ ^ i = n I m n

⇒ | H ˜ θ i | = n | θ i | | I θ i

For which I θ i is the Fisher information matrix for a single data point y 1 , and after substituting we final get for BIC:

2 log P ( y | M i ) = 2 log L ( θ ^ i | y ) + 2 log g i ( θ ˜ i ) + | θ i | log ( 2 π ) − | θ i | log n − log | I θ i |

M j = { P ( y | θ j ) : θ j ∈ Θ j } , | Different models(each is a set of density) |
---|---|

K ( x , y ) | The Kullback-Leibler distance between x, y |

l j ( θ j ) | the log-likelihood function for model j |

P ^ j ( y ) = P ( y | θ ^ j ) | An estimate of P based on model j |

d j | The dimension of Θ j |

Y j | The Data drawn from density P |

θ ^ j | The MLE of model j |

s ( y | θ j ) = ∂ log P ( y | θ j ) ∂ θ j | The Jaccobi Matrix of log P ( y | θ j ) |

We can measure the quality of p ^ j ( y ) (as an estimate of p) by the Kullback-Leibler distance [

K ( p , p ^ j ) = ∫ p ( y ) log ( p ( y ) p ^ j ( y ) ) d y = ∫ p ( y ) log p ( y ) d y − ∫ p ( y ) log p ^ j ( y ) d y

So, we want to minimize K ( p , p ^ j ) over j, which is the same as maximizing

K j = ∫ p ( y ) log p ( y | θ ^ j ) d y

For calculating K j , we can use Monte Carlo method to do an estimate

K ¯ j = 1 n ∑ i = 1 n log p ( Y i | θ ^ j ) = l j ( θ ^ j ) n

However, this estimate is very biased because the data are being used twice: first to get the MLE and second to estimate the integral by Monte Carlo method, and the bias is approximayely d j n . That means we should prove [

K ¯ j − d n ≈ K j

Choose θ j 0 , s.t. p ( y | θ j 0 ) = max θ j ∈ Θ j p ( y | θ j ) , and let

s ( y , θ j ) = ∂ log p ( y | θ j ) ∂ θ j , H ( y , θ j ) = ∂ 2 log p ( y | θ j ) ∂ θ j 2

So, s ( y , θ j ) is the Jacobi martix of log p ( y | θ j ) , and H ( y , θ j ) is the Hessian martix of log p ( y | θ j ) .

⇒ K j ≈ ∫ p ( y ) ( log p ( y | θ j 0 ) + ( θ ^ − θ j 0 ) T s ( y , θ j 0 ) + 1 2 ( θ ^ − θ j 0 ) T H ( y , θ j 0 ) ( θ ^ − θ j 0 ) ) d y = K 0 − 1 2 n Z n T J Z n

where

K 0 = ∫ p ( y ) log p ( y | θ j 0 ) d y , Z n = n ( θ ^ j − θ j 0 ) , J = − E [ H ( y , θ j 0 ) ] .

⇒ K ¯ j ≈ 1 n ∑ i = 1 n ( l ( Y i , θ j 0 ) + ( θ ^ − θ j 0 ) T s ( Y i , θ j 0 ) + 1 2 ( θ ^ − θ j 0 ) T H ( Y i , θ j 0 ) ( θ ^ − θ j 0 ) ) = K 0 + A n + ( θ ^ − θ 0 ) T S n − 1 2 n Z n T J n Z n = K 0 + A n + Z n T S n n − 1 2 n Z n T J Z n

where

A n = 1 n ∑ i = 1 n ( l ( Y i , θ 0 ) − K 0 ) , S n = 1 n ∑ i = 1 n s ( Y i , θ 0 )

and

J n = − 1 n ∑ i = 1 n H ( Y i , θ 0 ) → P J

K ¯ j − K j ≈ A n + n Z n T S n n ≈ A n + Z n T J Z n n

From the knowledge of asymptotic distribution, we have three claims [

Claim 4.1 Z n = n ( θ ^ − θ j 0 ) → N ( 0 , J − 1 V J − 1 ) , where V = V a r ( s ( Y , θ j 0 ) ) = J − 1 .

Claim 4.2 n S n = n n ∑ i = 1 n s ( Y i , θ j 0 ) → N ( 0 , V )

Claim 4.3 Let ϵ be a random vector with mean μ and covariance Σ , and Q = ϵ T A ϵ , then,

E ( Q ) = t r a c e ( A Σ ) + μ T A μ

So, with these calims above,

⇒ E ( K ¯ − K ) ≈ E ( A n ) + E ( Z n T J Z n n ) = 0 + E ( Z n T J Z n ) n = t r a c e ( J J − 1 V J − 1 ) n + 0 T J 0 = t r a c e ( V J − 1 ) n = t r a c e ( I ) n = d j n

⇒ K ^ j = l j ( θ ^ j ) n − d j n = K ¯ j − d j n

So, we define

A I C ( j ) = 2 n K ^ j = 2 l j ( θ ^ j ) − 2 d j

Let us consider again with the example in section 3, if we take data Y 1 , ⋯ , Y n ∼ N ( θ ,1 ) , and compare it with two models, such that, M 0 : N ( 0,1 ) and M 1 : N ( θ ,1 ) . Then take the same hypothesis as in section 3.2, we test:

H 0 : θ = 0 vs H 1 : θ ≠ 0

By standard normal distribution we have,

Z = Y ¯ − 0 V a r ( Y ¯ ) = n Y ¯

In case to avoid Type I error in our test, for α = 0.05 , by Z table, we would reject H 0 if | Z | > z α 2 ≈ 1.96 (we take | Z | > 2 ). Which implies if | Y ¯ | > 2 n , we reflect H 0 .

Case 1: BIC

For what we have showed in section 4.1, we proved that B I C = 2 log L ( θ ^ i | y ) + 2 log g i ( θ ˜ i ) + | θ i | log ( 2 π ) − | θ i | log n − log | I θ i | . However, in case to make comparison with two models, we could get away some unnecessary part, we take B I C = log L ( θ ^ i | y ) − | θ i | 2 log n . Thus,

For H 0 ,

B I C = log L ( 0 ) − 0 2 log n = − n Y ¯ 2 2 − n S 2 2

and H 1 ,

B I C = log L ( θ ^ ) − 1 2 log n = − n S 2 2 − 1 2 log n

where S 2 = ∑ i ( Y i − Y ¯ ) 2 . If we want to choose M 1 as a better model, then we would make − n Y ¯ 2 2 − n S 2 2 < − n S 2 2 − 1 2 log n , in other words, | Y ¯ | > log n n . And BIC is an estimate of a function of the posterior probability of a model under Bayesian setup.

Case 2: AIC

And from section 4.2, for A I C = 2 l j ( θ ^ j ) − 2 d j , for which as what we have defined above S 2 = ∑ i ( Y i − Y ¯ ) 2 that l ( θ ) = − n ( Y ¯ − θ ) 2 2 − n S 2 2 . Further deduce A I C = l S − | S | Thus,

For H 0 ,

A I C = l ( 0 ) − 0 = − n Y ¯ 2 2 − n S 2 2

and H 1 ,

A I C = l ( θ ) − 1 = − n S 2 2 − 1

If we want to choose M 1 as a better model at this point, we would take − n Y ¯ 2 2 − n S 2 2 < n S 2 2 − 1 , implies | Y ¯ | > 2 n . Which AIC is estimate a constant plus the relative distance between unknow likelihood function.

The question of how to choose a best model and what is a best model, it is hard to define. More precise, the controversy has existed for a long time, and no doubt it will continue longer. In this paper, we have discussed Bayes factor in hypothesis. It is obviously that Bayes factor is increasingly used in many fields of statistic research. For Bayes factor standard methods, AIC and BIC, we would consider to use for model selection. However, we also should notice that for all methods they all have their own limitation, such as the sensitivity of priors in Lindley’s paradox. Even both frequentist and Bayesian statisticians have came up with different new ideas, it is still hard to be implemented or understand by all other. Moreover, from statistic point, the method also needs to be general enough to apply. Such as for Lindley’s paradox, the partial Bayes factor in case to avoid the sensitive of priors, it takes the minimal training sample from data set to get prior and then apply with rest of the data. Partial Bayes factor at some point did deduce the influence of sensitivity of prior, but how to find the minimal training sample could also be a hard problem. Same as fractional Beyes factor, even it proves the method of choosing data for partial Bayes facto, it still has many limitations we need consider.

The author declares no conflicts of interest regarding the publication of this paper.

Nie, X.T. (2020) Bayes Factor with Lindley Paradox and Tow Standard Methods in Model. Open Journal of Statistics, 10, 74-86. https://doi.org/10.4236/ojs.2020.101006