Skip to main content
Log in

Objective Bayesian inference with proper scoring rules

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

Standard Bayesian analyses can be difficult to perform when the full likelihood, and consequently the full posterior distribution, is too complex or even impossible to specify or if robustness with respect to data or to model misspecifications is required. In these situations, we suggest to resort to a posterior distribution for the parameter of interest based on proper scoring rules. Scoring rules are loss functions designed to measure the quality of a probability distribution for a random variable, given its observed value. Important examples are the Tsallis score and the Hyvärinen score, which allow us to deal with model misspecifications or with complex models. Also the full and the composite likelihoods are both special instances of scoring rules. The aim of this paper is twofold. Firstly, we discuss the use of scoring rules in the Bayes formula in order to compute a posterior distribution, named SR-posterior distribution, and we derive its asymptotic normality. Secondly, we propose a procedure for building default priors for the unknown parameter of interest that can be used to update the information provided by the scoring rule in the SR-posterior distribution. In particular, a reference prior is obtained by maximizing the average \(\alpha \)-divergence from the SR-posterior distribution. For \(0 \le |\alpha |<1\), the result is a Jeffreys-type prior that is proportional to the square root of the determinant of the Godambe information matrix associated with the scoring rule. Some examples are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Barndorff-Nielsen OE (1976) Plausibility inference. J R Stat Soc B 38:103–131

    MathSciNet  MATH  Google Scholar 

  • Basu A, Harris IR, Hjort NL, Jones MC (1998) Robust and efficient estimation by minimising a density power divergence. Biometrika 85:549–559

    Article  MathSciNet  MATH  Google Scholar 

  • Berger JO (2006) The case for objective Bayesian analysis. Bayesian Anal 1:1–17

    Article  MathSciNet  MATH  Google Scholar 

  • Berger JO, Bernardo JM (1992) On the development of reference priors (with discussion). In: Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds) Bayesian statistics. Oxford University Press, Oxford

    Google Scholar 

  • Berger JO, Bernardo JM, Sun D (2009) The formal definition of reference priors. J Am Stat Assoc 107:636–648

    Article  MATH  Google Scholar 

  • Berger JO, Bernardo JM, Sun D (2012) Overall objective priors. Bayesian Anal 10:189–221

    Article  MathSciNet  MATH  Google Scholar 

  • Bernardo JM (1979) Reference posterior distributions for Bayesian inference (with discussion). J R Stat Soc B 41:113–147

    MATH  Google Scholar 

  • Bernardo JM (2005) Reference analysis. In: Dey DK, Rao CR (eds) Handbook of Bayesian statistics, vol 25. Elsevier, North-Holland, pp 17–90

    Google Scholar 

  • Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78:1–3

    Article  Google Scholar 

  • Chandler RE, Bate S (2007) Inference for clustered data using the independence loglikelihood. Biometrika 94:167–183

    Article  MathSciNet  MATH  Google Scholar 

  • Chang I, Mukerjee R (2008) Bayesian and frequentist confidence intervals arising from empirical-type likelihoods. Biometrika 95:139–147

    Article  MathSciNet  MATH  Google Scholar 

  • Corcuera JM, Giummolè F (1998) A characterization of monotone and regular divergences. Ann Inst Stat Math 50:433–450

    Article  MathSciNet  MATH  Google Scholar 

  • Datta GS, Mukerjee R (2004) Probability matching priors: higher order asymptotics. Lecture Notes in Statistics 178. Springer

  • Dawid AP (1986) Probability forecasting. In: Kotz S, Johnson NL, Read CB (eds) Encyclopedia of statistical sciences. Wiley, New York, pp 210–218

    Google Scholar 

  • Dawid AP (2007) The geometry of proper scoring rules. Ann Inst Stat Math 59:77–93

    Article  MathSciNet  MATH  Google Scholar 

  • Dawid AP, Lauritzen SL (2005) The geometry of decision theory. In: Proceedings of the second international symposium on information geometry and its applications, pp 22–28. University of Tokyo

  • Dawid AP, Musio M (2014) Theory and applications of proper scoring rules. Metron 72:169–183

    Article  MathSciNet  MATH  Google Scholar 

  • Dawid AP, Musio M (2015) Bayesian model selection based on proper scoring rules (with discussion). Bayesian Anal 10:479–521

    Article  MathSciNet  MATH  Google Scholar 

  • Dawid AP, Musio M, Ventura L (2016) Minimum scoring rule inference. Scand J Stat 43:123–138

    Article  MathSciNet  MATH  Google Scholar 

  • Farcomeni A, Ventura L (2012) An overview of robust methods in medical research. Stat Methods Med Res 21:111–133

    Article  MathSciNet  Google Scholar 

  • Godambe VP (1960) An optimum property of regular maximum likelihood estimation. Ann Math Stat 31:1208–1211

    Article  MathSciNet  MATH  Google Scholar 

  • Good IJ (1952) Rational decisions. J R Stat Soc B 14:107–114

    MathSciNet  Google Scholar 

  • Ghosh M (2011) Objective priors: an introduction for frequentists. Stat Sci 26:187–202

    Article  MathSciNet  MATH  Google Scholar 

  • Ghosh M, Basu A (2013) Robust estimation for independent non-homogeneous observations using density power divergence with applications to linear regression. Electron J Stat 7:2420–2456

    Article  MathSciNet  MATH  Google Scholar 

  • Ghosh M, Basu A (2016) Robust Bayes estimation using the density power divergence. Ann Inst Stat Math 68:413–437

    Article  MathSciNet  MATH  Google Scholar 

  • Ghosh M, Mergel V, Liu R (2011) A general divergence criterion for prior selection Ann Inst Stat Math 43–58

  • Ghosh JK, Mukerjee R (1991) Characterization of priors under which Bayesian and frequentist Bartlett corrections are equivalent in the multi-parameter case. J Mult Anal 38:385–393

    Article  MATH  Google Scholar 

  • Greco L, Racugno W, Ventura L (2008) Robust likelihood functions in Bayesian inference. J Stat Plan Inf 138:1258–1270

    Article  MathSciNet  MATH  Google Scholar 

  • Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986) Robust statistics: the approach based on influence functions. Wiley, New York

    MATH  Google Scholar 

  • Heritier S, Cantoni E, Copt S, Victoria-Feser MP (2009) Robust methods in biostatistics. Wiley, Chichester

    Book  MATH  Google Scholar 

  • Huber PJ, Ronchetti EM (2009) Robust statistics. Wiley, New York

    Book  MATH  Google Scholar 

  • Hyvärinen A (2005) Estimation of non-normalized statistical models by score matching. J Mach Learn Res 6:695–709

    MathSciNet  MATH  Google Scholar 

  • Hyvärinen A (2007) Some extensions of score matching. Comput Stat Data Anal 51:2499–2512

    Article  MathSciNet  MATH  Google Scholar 

  • Jeffreys H (1961) Theory of probability. Oxford University Press, New York

    MATH  Google Scholar 

  • Lazar NA (2003) Bayes empirical likelihood. Biometrika 90:319–326

    Article  MathSciNet  MATH  Google Scholar 

  • Leisen F, Villa C, Walker SG (2017) On a global objective prior from score rules. arXiv: 1706.00599v1

  • Lin L (2006) Quasi Bayesian likelihood. Stat Methodol 3:444–455

    Article  MathSciNet  MATH  Google Scholar 

  • Liu R, Chakrabarti A, Samanta T, Ghosh JK, Ghosh M (2014) On divergence measures leading to Jeffreys and other reference priors. Bayesian Anal 9:331–370

    Article  MathSciNet  MATH  Google Scholar 

  • Machete R (2013) Contrasting probabilistic scoring rules. J Stat Plan Inf 143:1781–1790

    Article  MathSciNet  MATH  Google Scholar 

  • Mameli V, Musio M, Ventura L (2018) Bootstrap adjustments of signed scoring rule root statistics. Comm Statist - Simul Comput 47:1204–1215

  • Mameli V, Ventura L (2015) Higher-order asymptotics for scoring rules. J Stat Plan Inf 165:13–26

    Article  MathSciNet  MATH  Google Scholar 

  • Mardia KV, Jupp PE (2000) Directional statistics. Wiley, London

    MATH  Google Scholar 

  • Mardia KV, Kent JT, Laha AK (2016) Score matching estimators for directional distributions. ArXiv:1604.08470v1

  • Musio M, Mameli V, Ruli E, Ventura L (2017) Bayesian Inference for directional data through ABC and homogeneous proper scoring rules. Proceeding of 61\(st\) ISI World Statistics Congress, to appear

  • Pace L, Salvan A, Sartori N (2011) Adjusting composite likelihood ratio statistics. Stat Sin 21:129–148

    MathSciNet  MATH  Google Scholar 

  • Parry M, Dawid AP, Lauritzen SL (2012) Proper local scoring rules. Ann Stat 40:561–592

    Article  MathSciNet  MATH  Google Scholar 

  • Pauli F, Racugno W, Ventura L (2011) Bayesian composite marginal likelihoods. Stat Sin 21:149–164

    MathSciNet  MATH  Google Scholar 

  • Riani M, Atkinson AC, Perrotta D (2014) A parametric framework for the comparison of methods of very robust regression. Stat Sci 29:128–143

    Article  MathSciNet  MATH  Google Scholar 

  • Ribatet M, Cooley D, Davison AC (2012) Bayesian inference from composite likelihoods, with an application to spatial extremes. Stat Sin 22:813–845

    MathSciNet  MATH  Google Scholar 

  • Ruli E, Sartori N, Ventura L (2016) Approximate Bayesian Computation with composite score functions. Stat Comput 26:679–692

    Article  MathSciNet  MATH  Google Scholar 

  • Schennach S (2005) Bayesian exponentially tilted empirical likelihood. Ann Stat 35:634–672

    Article  MathSciNet  MATH  Google Scholar 

  • Smith EL, Stephenson AG (2009) An extended Gaussian max-stable process model for spatial extremes. J Stat Plan Inf 139:1266–1275

    Article  MathSciNet  MATH  Google Scholar 

  • Tsallis C (1988) Possible generalization of Boltzmann–Gibbs statistics. J Stat Phys 52:479–487

    Article  MathSciNet  MATH  Google Scholar 

  • Varin C, Reid N, Firth D (2011) An overview of composite likelihood methods. Stat Sin 21:5–42

    MathSciNet  MATH  Google Scholar 

  • Ventura L, Cabras S, Racugno W (2010) Default prior distributions from quasi- and quasi-profile likelihoods. J Stat Plan Inf 140:2937–2942

    Article  MathSciNet  MATH  Google Scholar 

  • Ventura L, Racugno W (2016) Pseudo-likelihoods for Bayesian inference. Topics on methodological and applied statistical inference, series studies in theoretical and applied statistics. Springer, Berlin, pp 205–220

    Google Scholar 

  • Walker SG (2016) Bayesian information in an experiment and the Fisher information distance. Stat Probab Lett 112:5–9

    Article  MathSciNet  MATH  Google Scholar 

  • Yang Y, He X (2012) Bayesian empirical likelihood for quantile regression. Ann Stat 40:1102–1131

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to E. Ruli.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5845 KB)

Appendix

Appendix

1.1 Proof of Theorem 3.1

Proof

The posterior (5) can equivalently be written as

$$\begin{aligned} \pi _{SR} (\theta |x)=\frac{\pi (\theta ) \exp {\{-S (\theta ^*)+S(\tilde{\theta })\}}}{\int _\Theta \pi (\theta ) \exp {\{-S (\theta ^*)+S(\tilde{\theta })\}} {\hbox {d}}\theta }, \end{aligned}$$

with \(\theta ^*=\tilde{\theta }+C(\theta -\tilde{\theta })\) and C fixed such that \(C^TK(\theta )C=G(\theta )\).

Let \(w=(w^1,\ldots ,w^d)^T=n^{1/2}(\theta -\tilde{\theta })\). Then, \(\theta =\tilde{\theta }+n^{-1/2}w\) and \(\theta ^*=\tilde{\theta }+n^{-1/2}Cw\). A posterior for w is

$$\begin{aligned} \pi _{SR} (w|x)=\frac{b(w,x)}{\int b(w,x) {\hbox {d}}w}, \end{aligned}$$

with

$$\begin{aligned} b(w,x)= \pi (\tilde{\theta }+n^{-1/2}w) \exp {\{-S (\tilde{\theta }+n^{-1/2}Cw)+S(\tilde{\theta })\}}. \end{aligned}$$

Now

$$\begin{aligned} \pi (\tilde{\theta }+n^{-1/2}w)= \tilde{\pi }\left( 1+n^{-1/2}R_1(w)+\frac{1}{2}n^{-1}R_2(w)\right) +O_p(n^{-3/2}), \end{aligned}$$

with

$$\begin{aligned} R_1(w)=\frac{\tilde{\pi }_i}{\tilde{\pi }}w^i \quad \quad \text{ and } \quad \quad R_2(w)=\frac{\tilde{\pi }_{ij}}{\tilde{\pi }}w^{ij}, \end{aligned}$$

where \(w^{ij\ldots }=w^iw^j\ldots \) is a product of components of w. Moreover,

$$\begin{aligned} -S (\tilde{\theta }+n^{-1/2}Cw)+S(\tilde{\theta })= & {} -n^{-1}\left( \frac{1}{2}(Cw)^{ij}\tilde{S}_{ij} +\frac{1}{6}n^{-1/2}(Cw)^{ijk}\tilde{S}_{ijk}\right. \nonumber \\&\left. +\,\frac{1}{24}n^{-1}(Cw)^{ijkh}\tilde{S}_{ijkh}\right) +O_p(n^{-3/2}) \nonumber \\= & {} -\,\frac{1}{2}n^{-1}w^T\left( C^T\frac{\partial ^2\tilde{S}}{\partial \theta \partial \theta ^T}C\right) w-\frac{1}{6}n^{-1/2}R_3(w)\nonumber \\&-\,\frac{1}{24}n^{-1}R_4(w) +O_p(n^{-3/2})\nonumber \\= & {} -\,\frac{1}{2}w^T\tilde{H} w-\frac{1}{6}n^{-1/2}R_3(w)-\frac{1}{24}n^{-1}R_4(w)\nonumber \\&+\,O_p(n^{-3/2}), \end{aligned}$$
(15)

where

$$\begin{aligned} R_3(w)=n^{-1}(Cw)^{ijk}\tilde{S}_{ijk} \quad \quad \text{ and } \quad \quad R_4(w)=n^{-1}(Cw)^{ijkh}\tilde{S}_{ijkh}. \end{aligned}$$

The numerator in Bayes formula can thus be written as

$$\begin{aligned} b(w,x)= & {} \tilde{\pi }\left( 1+n^{-1/2}R_1(w) +\frac{1}{2}n^{-1} R_2(w)\right) \nonumber \\&\exp {\left\{ -\frac{1}{2}w^T\tilde{H} w \right\} }\left( 1{-}\frac{1}{6}n^{-1/2}R_3(w){-}\frac{1}{24}n^{-1}R_4(w) {+}\frac{1}{72}n^{-1}R_3(w)^2\right) \nonumber \\&+O_p(n^{-3/2})= \tilde{\pi }\exp {\left\{ -\frac{1}{2}w^T\tilde{H} w \right\} }\left[ 1+n^{-1/2}\left( R_1(w)-\frac{1}{6}R_3(w)\right) \right. \nonumber \\&\left. +\,n^{-1}\left( \frac{1}{2}R_2(w) -\frac{1}{6}R_1(w)R_3(w)-\frac{1}{24}R_4(w)+ \frac{1}{72}R_3(w)^2 \right) \right] \nonumber \\&+\,O_p(n^{-3/2}). \end{aligned}$$
(16)

The denominator can be approximated using the moments of the d-variate normal distribution \(N_d(0,\tilde{H}^{-1})\). We have that

$$\begin{aligned} \int b(w,x) dw= & {} \tilde{\pi }(2\pi )^{d/2}|\tilde{H}|^{-1/2} \left[ 1+n^{-1}\left( \frac{1}{2}E(R_2(W)) -\frac{1}{6}E(R_1(W)R_3(W))\right. \right. \nonumber \\&\left. \left. -\frac{1}{24}E(R_4(W))+ \frac{1}{72}E(R_3(W)^2) \right) \right] + o(n^{-1}), \end{aligned}$$
(17)

being \(E(R_1(W))\) and \(E(R_3(W))\) both related to odd moments and thus equal to zero. Now,

$$\begin{aligned} E(R_2(W))= & {} \frac{\tilde{\pi }_{ij}}{\tilde{\pi }} \tilde{h}^{ij} , \quad \quad E(R_1(W)R_3(W))= 3\frac{\tilde{\pi }_{i}}{\tilde{\pi }} \frac{\tilde{S}_{jkh}}{n}c_{jr}c_{ks}c_{ht}\tilde{h}^{ir}\tilde{h}^{st} ,\\ E(R_4(W))= & {} 3 \frac{\tilde{S}_{ijkh}}{n}c_{ir}c_{js}c_{kt}c_{hu}\tilde{h}^{rs}\tilde{h}^{tu} \end{aligned}$$

and

$$\begin{aligned} E(R_3(W)^2)= & {} \frac{\tilde{S}_{ijk}\tilde{S}_{rst}}{n^2}(9c_{ia}c_{jb}c_{kc}c_{rd}c_{se}c_{tf}\tilde{h}^{ab}\tilde{h}^{cd}\tilde{h}^{ef}\\&+6c_{ia}c_{rb}c_{jc}c_{sd}c_{ke}c_{tf}\tilde{h}^{ab}\tilde{h}^{cd}\tilde{h}^{ef}). \end{aligned}$$

Putting together expressions (16) and (17), we finally obtain the result with

$$\begin{aligned} A_1(w)=R_1(w)-\frac{1}{6}R_3(w) \end{aligned}$$

and

$$\begin{aligned} A_2(w)= & {} \frac{1}{2}\left( R_2(w)-E(R_2(W))\right) -\frac{1}{6}\left( R_1(w)R_3(w)-E(R_1(W)R_3(W))\right) \\&-\,\frac{1}{24}\left( R_4(w)-E(R_4(W))\right) +\frac{1}{72}\left( R_3(w)^2-E(R_3(W)^2)\right) . \end{aligned}$$

\(\square \)

1.2 Proof of Theorem 4.1

Proof

The proof follows the same steps as in Liu et al. (2014), Section 3, generalized for the use of a SR-posterior (5) instead of the classic posterior distribution.

Let \(\pi (\theta )\) be a prior distribution for \(\theta \) and \(p(x|\theta )\) be the conditional distribution of X given \(\theta \). Thus, the marginal distribution of X can be written as \(p(x)=p(x|\theta )\pi (\theta )/\pi (\theta |x)\), with \(\pi (\theta |x)\) the conditional distribution of \(\theta \) given x. The functional (8) associated with an \(\alpha \)-divergence between the prior \(\pi (\theta )\) and the SR-posterior (5) can be written as

$$\begin{aligned} T(\pi )= & {} \frac{1}{\alpha (1-\alpha )}\left[ 1- \int _{\mathcal {X}} \left( \int _{\Theta } \pi (\theta )^\alpha \pi _{SR}(\theta |x)^{1-\alpha } d\theta \right) p(x) {\hbox {d}}x \right] \nonumber \\= & {} \frac{1}{\alpha (1-\alpha )}\left[ 1- \int _{\Theta } \left( \int _{\mathcal {X}} \pi (\theta )^\alpha \pi _{SR}(\theta |x)^{1-\alpha } \pi (\theta |x)^{-1}p(x|\theta ){\hbox {d}}x \right) \pi (\theta ){\hbox {d}}\theta \right] \nonumber \\= & {} \frac{1}{\alpha (1-\alpha )}\left[ 1-\int _{\Theta } \pi (\theta )^{\alpha +1}E_\theta \left[ \pi _{SR}(\theta |X)^{1-\alpha }\pi (\theta |X)^{-1}\right] {\hbox {d}}\theta \right] , \end{aligned}$$
(18)

where \(E_\theta (\cdot )\) denotes expectation with respect to the conditional distribution of X given \(\theta \). For \(\alpha =0\) or 1, we need to interpret \(T(\pi )\) as its limiting value, when it exists.

Now, we apply the shrinkage argument for evaluating the conditional expectation in (18), \(E_\theta \left[ \pi _{SR}(\theta |X)^{1-\alpha }\pi (\theta |X)^{-1}\right] \). See Datta and Mukerjee (2004) for details on the shrinkage method.

Let us first consider the case of \(0<|\alpha |<1\).

The first step of the shrinkage method involves fixing a suitable prior distribution \(\bar{\pi }(\theta )\) and calculating the expected value of \(\pi _{SR}(\theta |x)^{1-\alpha }\pi (\theta |x)^{-1}\) with respect to the corresponding posterior \(\bar{\pi }(\theta |x)\), i.e. \(E^{\bar{\pi }}\left[ \pi _{SR}(\theta |x)^{1-\alpha }\pi (\theta |x)^{-1}|x\right] \). Notice that here we use the classic posterior based on the prior \(\bar{\pi }(\theta )\), i.e. the conditional distribution of \(\theta \) given x, \(\bar{\pi }(\theta |x)\). Using up to first order the asymptotic expansion (7) for \(\pi _{SR}(\theta |x)\) and similar expansions for \(\bar{\pi }(\theta |x)\) and \(\pi (\theta |x)\), we obtain

$$\begin{aligned} \pi _{SR}(\theta |X)^{1-\alpha }=\left( \frac{n^d|\tilde{H}|}{(2\pi )^d}\right) ^{(1-\alpha )/2} \exp {\left\{ \frac{n(1-\alpha )}{2}(\theta -\tilde{\theta })^T\tilde{H}(\theta -\tilde{\theta })\right\} }+O_p(n^{-1/2}) \end{aligned}$$

and

$$\begin{aligned} \pi (\theta |X)^{-1}\bar{\pi }(\theta |X)=1 +O_p(n^{-1/2}). \end{aligned}$$

Since \(\tilde{H}\) tends to G as \(n\rightarrow \infty \), we have that

$$\begin{aligned} E^{\bar{\pi }}\left[ \pi _{SR}(\theta |x)^{1-\alpha }\pi (\theta |x)^{-1}|x\right]= & {} \left( \frac{n^d|\tilde{H}|}{(2\pi )^d}\right) ^{(1-\alpha )/2}\\&\int _{\Theta } \exp {\left\{ \frac{n(1-\alpha )}{2}(\theta -\tilde{\theta })^T\tilde{H}(\theta -\tilde{\theta })\right\} } d\theta \\&+\,O(n^{-1/2})\\= & {} \left( \frac{n^d|G(\theta )|}{(2\pi )^d}\right) ^{-\alpha /2}(1-\alpha )^{-d/2}+O(n^{-1/2}). \end{aligned}$$

The second step of the shrinkage argument requires to integrate again with respect to the distribution of X given \(\theta \) and with respect to the prior \(\bar{\pi }(\theta )\), thus obtaining

$$\begin{aligned}&\int _\Theta E_\theta \left[ E^{\bar{\pi }}\left[ \pi _{SR}(\theta |X)^{1-\alpha }\pi (\theta |X)^{-1}|X\right] \right] \bar{\pi }(\theta ) {\hbox {d}}\theta \\&\quad =\int _\Theta \left( \frac{n^d|G(\theta )|}{(2\pi )^d}\right) ^{-\alpha /2}(1-\alpha )^{-d/2} \, \bar{\pi }(\theta ) \, {\hbox {d}}\theta +O(n^{-1/2}). \end{aligned}$$

Finally, by letting the prior \(\bar{\pi }(\theta )\) go to \(\theta \), we obtain

$$\begin{aligned} E_\theta \left[ \pi _{SR}(\theta |X)^{1-\alpha }\pi (\theta |X)^{-1}\right]= & {} \left( \frac{n}{2\pi }\right) ^{-d\alpha /2}|G(\theta )|^{-\alpha /2}(1-\alpha )^{-d/2}+O(n^{-1/2}).\nonumber \\ \end{aligned}$$
(19)

For \(0<|\alpha |<1\), by substituting (19) in (18) we can see that the selection of a prior \(\pi (\theta )\) corresponds to the minimization with respect to \(\pi (\theta )\) of the functional

$$\begin{aligned} \frac{1}{\alpha (1-\alpha )}\int _\Theta \pi (\theta )^{\alpha +1}|G(\theta )|^{-\alpha /2}d\theta . \end{aligned}$$
(20)

Notice that the preceding expression can be interpreted as an increasing transformation of the \((-\alpha )\)-divergence between a density that is proportional to \(|G(\theta )|^{1/2}\) and the prior \(\pi (\theta )\). Thus, it is minimized if and only if the two densities coincide.

This result cannot be extended to the case of \(\alpha \ge 1\), as is evident from the right-hand side of (19). For \(\alpha <-1\), (20) turns out to be equivalent to a decreasing transformation of the \((-\alpha )\)-divergence between a density that is proportional to \(|G(\theta )|^{1/2}\) and the prior \(\pi (\theta )\). Thus, a maximizer for the expected \(\alpha \)-divergence in this case does not exist.

For \(\alpha \rightarrow 0\), following the same steps as for \(0<|\alpha |<1\), it can be easily shown that maximization of the average Kullback–Leibler divergence is asymptotically equivalent to minimization of

$$\begin{aligned} \int _\Theta \pi (\theta ) \log \left( \frac{\pi (\theta )}{|G(\theta )|^{1/2}}\right) {\hbox {d}}\theta , \end{aligned}$$

which is attained by choosing again a Jeffreys-type prior proportional to \( |G(\theta )|^{1/2}\). Indeed, the preceding expression is equal up to an additive constant to Kullback–Leibler divergence between a density that is proportional to \(|G(\theta )|^{1/2}\) and the prior \(\pi (\theta )\).

The case \(\alpha =-1\) corresponds to the Chi-square divergence. For this case the proof requires higher-order terms in the expansion of the scoring rule posterior distribution and also of both the conditional distributions of \(\theta \) given x calculated with respect to the priors \(\pi (\theta )\) and \(\bar{\pi }(\theta )\). Thus, we use (7) up to order \(O_p(n^{-3/2})\). In particular, let \(w=n^{1/2}(\theta -\tilde{\theta })\) and \(w_1=n^{1/2}(\theta -\hat{\theta })\), where \(\tilde{\theta }\) is the minimum scoring rule estimator and \(\hat{\theta }\) the maximum likelihood estimator for \(\theta \). The different posterior distributions for w and \(w_1\) given x can be written as

$$\begin{aligned} \pi _{SR} (w|x)= & {} \phi _d(w;\tilde{H}^{-1}) \left[ 1+n^{-1/2}A_1(w) + n^{-1}A_2(w)\right] +O_p(n^{-3/2}),\\ \pi (w_1|x)= & {} \phi _d(w_1;(\hat{J}^\ell )^{-1}) \left[ 1+n^{-1/2}A^\ell _1(w_1) + n^{-1}A^\ell _2(w_1)\right] +O_p(n^{-3/2}),\\ \bar{\pi }(w_1|x)= & {} \phi _d(w_1;(\hat{J}^\ell )^{-1}) \left[ 1+n^{-1/2}\bar{A}^\ell _1(w_1) + n^{-1}\bar{A}^\ell _2(w_1)\right] +O_p(n^{-3/2}), \end{aligned}$$

where \(A_1\) and \(A_2\) are defined as in Theorem 3.1, \(J^\ell (\theta )=(-\partial ^2\ell /\partial \theta \partial \theta ^T)/n\) is the observed information and \(A^\ell _1\), \(\bar{A}^\ell _1\), \(A^\ell _2\), \(\bar{A}^\ell _2\) are obtained by calculating \(A_1\) and \(A_2\) with the logarithmic score as scoring rule and \(\pi \) and \(\bar{\pi }\) as priors; for instance,

$$\begin{aligned} A^\ell _1(w_1)= \frac{\hat{\pi }_i}{\hat{\pi }}w_1^i+\frac{1}{6}\frac{\hat{\ell }_{ijk}}{n}w_1^iw_1^jw_1^k, \quad \text{ and }\quad \bar{A}^\ell _1(w_1)= \frac{\hat{\bar{\pi }}_i}{\hat{\bar{\pi }}}w_1^i+\frac{1}{6}\frac{\hat{\ell }_{ijk}}{n}w_1^iw_1^jw_1^k, \end{aligned}$$

with \(\ell \) the log-likelihood and \(\ell _{ijk}\) its third-order derivatives. As usual, a tilde or a hat over a quantity indicate that it is calculated at \(\tilde{\theta }\) and \(\hat{\theta }\), respectively.

Following the shrinkage argument, we need first to evaluate

$$\begin{aligned} E^{\bar{\pi }}\left[ \pi _{SR}(\theta |x)^2 \pi (\theta |x)^{-1}|x\right]= & {} n^{d/2} \int \pi _{SR}(w|x)^2 \pi (w|x)^{-1} \bar{\pi }(w|x) {\hbox {d}}w. \end{aligned}$$

For doing this, we use the fact that

$$\begin{aligned} \pi _{SR} (w|x)^2= & {} (2\pi )^{-d/2} |\tilde{H}|^{1/2} 2^{-d/2} \phi _d(w;\tilde{H}^{-1}/2) \\&\left[ 1+n^{-1/2}2A_1(w) + n^{-1}(2A_2(w)+A_1(w)^2)\right] +O_p(n^{-3/2}) \end{aligned}$$

and

$$\begin{aligned} \pi (w_1|x)^{-1} \bar{\pi }(w_1|x)= & {} 1+n^{-1/2}(\bar{A}^\ell _1(w_1)-A^\ell _1(w_1)) \\&+\, n^{-1}(A^\ell _1(w_1)^2+\bar{A}^\ell _2(w_1)-A^\ell _2(w_1)-\bar{A}^\ell _1(w_1) A^\ell _1(w_1)) \\&+\,O_p(n^{-3/2}) \end{aligned}$$

and we calculate the last expression in \(w_1=w+n^{1/2}(\tilde{\theta }-\hat{\theta })\).

After tedious calculations, we obtain

$$\begin{aligned}&E^{\bar{\pi }}\left[ \pi _{SR}(\theta |x)^2 \pi (\theta |x)^{-1}|x\right] \\&\quad = (2\pi )^{-d/2}\left( \frac{n}{2}\right) ^{d/2} |\tilde{H}|^{1/2} \left\{ 1 + \frac{1}{n^{1/2}} \left( \frac{\hat{\bar{\pi }}_{i}}{\hat{\bar{\pi }}} -\frac{\hat{\pi }_{i}}{\hat{\pi }} \right) n^{1/2}(\tilde{\theta }-\hat{\theta })^i \right. \\&\quad \left. +\, \frac{1}{2n} \left[ \left( -\frac{\hat{\bar{\pi }}_{ij}}{\hat{\bar{\pi }}} +\frac{\hat{\pi }_{ij}}{\hat{\pi }} \right) \hat{\varvec{j}}^{ij} \right. \right. \\&\quad \left. \left. +\,\left( -\frac{\tilde{\pi }_{ij}}{\tilde{\pi }}+\frac{\tilde{\pi }_{i}}{\tilde{\pi }}\frac{\tilde{\pi }_{j}}{\tilde{\pi }} +2\frac{\tilde{\pi }_{i}}{\tilde{\pi }}\frac{\hat{\bar{\pi }}_{j}}{\hat{\bar{\pi }}} -2\frac{\tilde{\pi }_{i}}{\tilde{\pi }}\frac{\hat{\pi }_{j}}{\hat{\pi }} +\frac{\hat{\pi }_{i}}{\hat{\pi }}\frac{\hat{\pi }_{j}}{\hat{\pi }} -\frac{\hat{\pi }_{i}}{\hat{\pi }}\frac{\hat{\bar{\pi }}_{j}}{\hat{\bar{\pi }}} +\frac{1}{2}\frac{\hat{\bar{\pi }}_{ij}}{\hat{\bar{\pi }}} -\frac{1}{2}\frac{\hat{\pi }_{ij}}{\hat{\pi }} \right) \tilde{h}^{ij} \right. \right. \\&\quad \left. \left. +\,\left( \frac{\tilde{\pi }_{i}}{\tilde{\pi }} -\frac{1}{2}\frac{\hat{\bar{\pi }}_{i}}{\hat{\bar{\pi }}} +\frac{1}{2}\frac{\hat{\pi }_{i}}{\hat{\pi }}\right) \frac{\tilde{S}_{khl}}{n} c_{kj}c_{hs}c_{lt}\tilde{h}^{st}\tilde{h}^{ij} +\left( -\frac{\hat{\bar{\pi }}_{i}}{\hat{\bar{\pi }}} +\frac{\hat{\pi }_{i}}{\hat{\pi }}\right) \frac{\tilde{\ell }_{jhk}}{n} \hat{\varvec{j}}^{hk}\hat{\varvec{j}}^{ij} \right. \right. \\&\quad \left. \left. +\, \left( -\frac{\hat{\pi }_{ij}}{\hat{\pi }}+\frac{\hat{\bar{\pi }}_{ij}}{\hat{\bar{\pi }}} +2\frac{\hat{\pi }_{i}}{\hat{\pi }} \frac{\hat{\pi }_{j}}{\hat{\pi }} -2\frac{\hat{\pi }_{i}}{\hat{\pi }} \frac{\hat{\bar{\pi }}_{j}}{\hat{\bar{\pi }}} \right) n(\tilde{\theta }-\hat{\theta })^{ij} \right] + R \right\} + O(n^{-1}), \end{aligned}$$

where \(\hat{\varvec{j}}^{ij}\) are the components of \((\hat{J}^\ell )^{-1}\) and \(R=R(\tilde{\theta },\hat{\theta })\) is a function that does not involve the prior nor its derivatives.

Now, let \(E_\theta (S_{ijk}/n)=B^S_{ijk}(\theta )+o(n^{-1/2})\), \(E_\theta (\ell _{ijk}/n)=B^\ell _{ijk}(\theta )+o(n^{-1/2})\) and recall that \(E_\theta (\tilde{H})=G(\theta )+o(n^{-1/2})\) and \(E_\theta (\hat{J}^\ell )=I(\theta )+o(n^{-1/2})\), where I is the expected information matrix. Moreover, note that \(E_\theta [(\tilde{\theta }-\hat{\theta })^i ]=o(n^{-1/2})\) and \(E_\theta [n(\tilde{\theta }-\hat{\theta })^{ij}]=g^{ij}+{\varvec{i}}^{ij}-2\sigma ^{ij}+o(n^{-1/2})\), where \(g_{ij}\) and \(g^{ij}\) are the components of G and \(G^{-1}\), respectively, \({\varvec{i}}^{ij}\) are the components of \(I^{-1}\) and \(\sigma ^{ij}=nCov_\theta (\tilde{\theta },\hat{\theta })^{ij}\). Furthermore, let \(a^S_j=B^S_{khl}c_{kj}c_{hs}c_{lt}g^{st}\), \(a^\ell _j=B^\ell _{jhk}{\varvec{i}}^{hk}\). By integrating again the above expression with respect to the distribution of X given \(\theta \), we get

$$\begin{aligned}&E_\theta \left[ E^{\bar{\pi }}\left[ \pi _{SR}(\theta |X)^2 \pi (\theta |X)^{-1}|X\right] \right] \\&\quad = (2\pi )^{-d/2}\left( \frac{n}{2}\right) ^{d/2} |G|^{1/2} \left\{ 1 + \frac{1}{2n} \left[ \left( -\frac{\bar{\pi }_{ij}}{\bar{\pi }} +\frac{\pi _{ij}}{\pi } \right) {\varvec{i}}^{ij} \right. \right. \\&\quad \left. \left. +\,\left( -\frac{3}{2}\frac{\pi _{ij}}{\pi }+\frac{\pi _{i}}{\pi }\frac{\bar{\pi }_{j}}{\bar{\pi }} +\frac{1}{2}\frac{\bar{\pi }_{ij}}{\bar{\pi }} \right) g^{ij} \right. \right. \\&\quad \left. \left. +\,\left( \frac{3}{2}\frac{\pi _{i}}{\pi } -\frac{1}{2}\frac{\bar{\pi }_{i}}{\bar{\pi }} \right) a^S_j g^{ij} -\left( \frac{\bar{\pi }_{i}}{\bar{\pi }} -\frac{\pi _{i}}{\pi }\right) a^\ell _j {\varvec{i}}^{ij} \right. \right. \\&\quad \left. \left. +\, \left( -\frac{\pi _{ij}}{\pi }+\frac{{\bar{\pi }}_{ij}}{{\bar{\pi }}} +2\frac{{\pi }_{i}}{{\pi }} \frac{{\pi }_{j}}{{\pi }} -2\frac{{\pi }_{i}}{{\pi }} \frac{{\bar{\pi }}_{j}}{{\bar{\pi }}} \right) (g^{ij}+{\varvec{i}}^{ij}-2\sigma ^{ij}) \right] + F \right\} + O(n^{-1}), \end{aligned}$$

where \(F=F(\theta )\) is a function that does not involve the prior nor its derivatives.

By integrating again with respect to the prior \(\bar{\pi }(\theta )\) and by letting the prior go to \(\theta \), we finally obtain the expected value needed in (18) for \(\alpha =-1\):

$$\begin{aligned}&\quad =(2\pi )^{-d/2}\left( \frac{n}{2}\right) ^{d/2} |G|^{1/2}\left\{ 1+\frac{1}{4n}\left[ \left( -3g^{ij} +4{\varvec{i}}^{ij}-4\sigma ^{ij}\right) \frac{\pi _{ij}}{\pi }\right. \right. \nonumber \\&\quad \left. \left. +\,\left( 3a^S_jg^{ij}+2a^\ell _j{\varvec{i}}^{ij}+|G|^{-1}\frac{\partial |G|}{\partial \theta ^j}(g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})+2 \frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j} \right) \frac{\pi _i}{\pi } \right. \right. \nonumber \\&\quad \left. \left. +\,2g^{ij}\frac{\pi _{i}}{\pi }\frac{\pi _{j}}{\pi } +M \right] \right\} +O(n^{-1}), \end{aligned}$$
(21)

where \(M=M(\theta )\) is a function that does not involve the prior nor its derivatives.

Substituting (21) in (18) with \(\alpha =-1\), we find that maximization of the average Chi-square divergence is equivalent to maximization of

$$\begin{aligned}&\int _\Theta |G|^{1/2}\left[ \phantom {\frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j}}\left( -3g^{ij} +4{\varvec{i}}^{ij}-4\sigma ^{ij}\right) \frac{\pi _{ij}}{\pi } +2g^{ij}\frac{\pi _{i}}{\pi }\frac{\pi _{j}}{\pi } \right. \\&\quad +\,\left( 3a^S_jg^{ij}+2a^\ell _j{\varvec{i}}^{ij}+|G|^{-1}\frac{\partial |G|}{\partial \theta ^j} \Big ({g^{ij}}g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij} \Big )\right. \\&\quad \left. \left. +\,2 \frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j} \right) \frac{\pi _i}{\pi } \right] d\theta . \end{aligned}$$

Let \(y=y(\theta )=(y_1(\theta ),\dots ,y_{d}(\theta ))^T\), with \(y_i(\theta )=\pi _i(\theta )/\pi (\theta )\), \(i=1,\ldots , {d}\), and let \(y'\) be the matrix of partial derivatives of y with respect to \(\theta \), i.e. \(y'=(y_{ij})_{ij}\), \(y_{ij}=\partial y_i/\partial \theta ^j\), \(i,j=1,\ldots { d}\). Then, \(\pi _{ij}/\pi =y_{ij}+y_iy_j\) and the quantity to be maximized can be rewritten as

$$\begin{aligned}&\int _\Theta |G|^{1/2} \left[ \phantom {\frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j}}\left( -3g^{ij} +4{\varvec{i}}^{ij}-4\sigma ^{ij}\right) y_{ij}+\left( -g^{ij} +4{\varvec{i}}^{ij}-4\sigma ^{ij}\right) y_iy_j\right. \\&\quad +\,\left( 3a^S_jg^{ij}+2a^\ell _j{\varvec{i}}^{ij}+|G|^{-1}\frac{\partial |G|}{\partial \theta ^j}(g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij} ) \right. \\&\quad \left. \left. +\,2 \frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j} \right) y_i \right] {\hbox {d}}\theta . \end{aligned}$$

Let us denote the integrand function by \(U(\theta ,y,y')\). The solution to the maximization problem is found by solving the system of Euler–Lagrange equations:

$$\begin{aligned} \frac{\partial U}{\partial y_i}-\sum _{j=1}^d\frac{\partial }{\partial \theta ^j}\left( \frac{\partial U}{\partial y_{ij}} \right) =0, \quad i=1,\ldots ,{d}. \end{aligned}$$

After some calculations, we obtain the solution to the variational problem as

$$\begin{aligned} y(\theta )=\frac{1}{4}\left[ 6a^S G^{-1} + 4a^\ell I^{-1}+ |G|^{-1}\frac{\partial |G|}{\partial \theta }(5G^{-1}-4\Sigma ) +2G \nabla \cdot (5G^{-1}-4\Sigma ) \right] \Gamma , \end{aligned}$$

where \(a^S=(a^S_1,\ldots ,a^S_{d})^T\), \(a^\ell =(a^\ell _1,\ldots ,a^\ell _{d})^T\), \(\Sigma =(\sigma ^{ij})_{ij}\), \(\Gamma =(G^{-1}-4I^{-1}+4\Sigma )^{-1}\), \(\nabla \cdot G^{-1}= (\partial g^{1j}/\partial \theta ^j, \ldots , \partial g^{{d} j}/\partial \theta ^j)^T\) and \(\nabla \cdot \Sigma = (\partial \sigma ^{1j}/\partial \theta ^j, \ldots , \partial \sigma ^{{d} j}/\partial \theta ^j)^T\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Giummolè, F., Mameli, V., Ruli, E. et al. Objective Bayesian inference with proper scoring rules. TEST 28, 728–755 (2019). https://doi.org/10.1007/s11749-018-0597-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-018-0597-z

Keywords

Mathematics Subject Classification

Navigation