Objective Bayesian inference with proper scoring rules

Giummolè, F.; Mameli, V.; Ruli, E.; Ventura, L.

doi:10.1007/s11749-018-0597-z

Objective Bayesian inference with proper scoring rules

Original Paper
Published: 12 July 2018

Volume 28, pages 728–755, (2019)
Cite this article

TEST Aims and scope Submit manuscript

F. Giummolè¹,
V. Mameli¹,
E. Ruli ORCID: orcid.org/0000-0001-8572-2149² &
…
L. Ventura²

575 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

Standard Bayesian analyses can be difficult to perform when the full likelihood, and consequently the full posterior distribution, is too complex or even impossible to specify or if robustness with respect to data or to model misspecifications is required. In these situations, we suggest to resort to a posterior distribution for the parameter of interest based on proper scoring rules. Scoring rules are loss functions designed to measure the quality of a probability distribution for a random variable, given its observed value. Important examples are the Tsallis score and the Hyvärinen score, which allow us to deal with model misspecifications or with complex models. Also the full and the composite likelihoods are both special instances of scoring rules. The aim of this paper is twofold. Firstly, we discuss the use of scoring rules in the Bayes formula in order to compute a posterior distribution, named SR-posterior distribution, and we derive its asymptotic normality. Secondly, we propose a procedure for building default priors for the unknown parameter of interest that can be used to update the information provided by the scoring rule in the SR-posterior distribution. In particular, a reference prior is obtained by maximizing the average $\alpha $-divergence from the SR-posterior distribution. For $0 \le |\alpha |<1$, the result is a Jeffreys-type prior that is proportional to the square root of the determinant of the Godambe information matrix associated with the scoring rule. Some examples are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian Variable Selection for Linear Models Using I-Priors

Detecting and diagnosing prior and likelihood sensitivity with power-scaling

Article Open access 31 December 2023

Noa Kallioinen, Topi Paananen, … Aki Vehtari

Pseudo-Likelihoods for Bayesian Inference

References

Barndorff-Nielsen OE (1976) Plausibility inference. J R Stat Soc B 38:103–131
MathSciNet MATH Google Scholar
Basu A, Harris IR, Hjort NL, Jones MC (1998) Robust and efficient estimation by minimising a density power divergence. Biometrika 85:549–559
Article MathSciNet MATH Google Scholar
Berger JO (2006) The case for objective Bayesian analysis. Bayesian Anal 1:1–17
Article MathSciNet MATH Google Scholar
Berger JO, Bernardo JM (1992) On the development of reference priors (with discussion). In: Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds) Bayesian statistics. Oxford University Press, Oxford
Google Scholar
Berger JO, Bernardo JM, Sun D (2009) The formal definition of reference priors. J Am Stat Assoc 107:636–648
Article MATH Google Scholar
Berger JO, Bernardo JM, Sun D (2012) Overall objective priors. Bayesian Anal 10:189–221
Article MathSciNet MATH Google Scholar
Bernardo JM (1979) Reference posterior distributions for Bayesian inference (with discussion). J R Stat Soc B 41:113–147
MATH Google Scholar
Bernardo JM (2005) Reference analysis. In: Dey DK, Rao CR (eds) Handbook of Bayesian statistics, vol 25. Elsevier, North-Holland, pp 17–90
Google Scholar
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78:1–3
Article Google Scholar
Chandler RE, Bate S (2007) Inference for clustered data using the independence loglikelihood. Biometrika 94:167–183
Article MathSciNet MATH Google Scholar
Chang I, Mukerjee R (2008) Bayesian and frequentist confidence intervals arising from empirical-type likelihoods. Biometrika 95:139–147
Article MathSciNet MATH Google Scholar
Corcuera JM, Giummolè F (1998) A characterization of monotone and regular divergences. Ann Inst Stat Math 50:433–450
Article MathSciNet MATH Google Scholar
Datta GS, Mukerjee R (2004) Probability matching priors: higher order asymptotics. Lecture Notes in Statistics 178. Springer
Dawid AP (1986) Probability forecasting. In: Kotz S, Johnson NL, Read CB (eds) Encyclopedia of statistical sciences. Wiley, New York, pp 210–218
Google Scholar
Dawid AP (2007) The geometry of proper scoring rules. Ann Inst Stat Math 59:77–93
Article MathSciNet MATH Google Scholar
Dawid AP, Lauritzen SL (2005) The geometry of decision theory. In: Proceedings of the second international symposium on information geometry and its applications, pp 22–28. University of Tokyo
Dawid AP, Musio M (2014) Theory and applications of proper scoring rules. Metron 72:169–183
Article MathSciNet MATH Google Scholar
Dawid AP, Musio M (2015) Bayesian model selection based on proper scoring rules (with discussion). Bayesian Anal 10:479–521
Article MathSciNet MATH Google Scholar
Dawid AP, Musio M, Ventura L (2016) Minimum scoring rule inference. Scand J Stat 43:123–138
Article MathSciNet MATH Google Scholar
Farcomeni A, Ventura L (2012) An overview of robust methods in medical research. Stat Methods Med Res 21:111–133
Article MathSciNet Google Scholar
Godambe VP (1960) An optimum property of regular maximum likelihood estimation. Ann Math Stat 31:1208–1211
Article MathSciNet MATH Google Scholar
Good IJ (1952) Rational decisions. J R Stat Soc B 14:107–114
MathSciNet Google Scholar
Ghosh M (2011) Objective priors: an introduction for frequentists. Stat Sci 26:187–202
Article MathSciNet MATH Google Scholar
Ghosh M, Basu A (2013) Robust estimation for independent non-homogeneous observations using density power divergence with applications to linear regression. Electron J Stat 7:2420–2456
Article MathSciNet MATH Google Scholar
Ghosh M, Basu A (2016) Robust Bayes estimation using the density power divergence. Ann Inst Stat Math 68:413–437
Article MathSciNet MATH Google Scholar
Ghosh M, Mergel V, Liu R (2011) A general divergence criterion for prior selection Ann Inst Stat Math 43–58
Ghosh JK, Mukerjee R (1991) Characterization of priors under which Bayesian and frequentist Bartlett corrections are equivalent in the multi-parameter case. J Mult Anal 38:385–393
Article MATH Google Scholar
Greco L, Racugno W, Ventura L (2008) Robust likelihood functions in Bayesian inference. J Stat Plan Inf 138:1258–1270
Article MathSciNet MATH Google Scholar
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986) Robust statistics: the approach based on influence functions. Wiley, New York
MATH Google Scholar
Heritier S, Cantoni E, Copt S, Victoria-Feser MP (2009) Robust methods in biostatistics. Wiley, Chichester
Book MATH Google Scholar
Huber PJ, Ronchetti EM (2009) Robust statistics. Wiley, New York
Book MATH Google Scholar
Hyvärinen A (2005) Estimation of non-normalized statistical models by score matching. J Mach Learn Res 6:695–709
MathSciNet MATH Google Scholar
Hyvärinen A (2007) Some extensions of score matching. Comput Stat Data Anal 51:2499–2512
Article MathSciNet MATH Google Scholar
Jeffreys H (1961) Theory of probability. Oxford University Press, New York
MATH Google Scholar
Lazar NA (2003) Bayes empirical likelihood. Biometrika 90:319–326
Article MathSciNet MATH Google Scholar
Leisen F, Villa C, Walker SG (2017) On a global objective prior from score rules. arXiv: 1706.00599v1
Lin L (2006) Quasi Bayesian likelihood. Stat Methodol 3:444–455
Article MathSciNet MATH Google Scholar
Liu R, Chakrabarti A, Samanta T, Ghosh JK, Ghosh M (2014) On divergence measures leading to Jeffreys and other reference priors. Bayesian Anal 9:331–370
Article MathSciNet MATH Google Scholar
Machete R (2013) Contrasting probabilistic scoring rules. J Stat Plan Inf 143:1781–1790
Article MathSciNet MATH Google Scholar
Mameli V, Musio M, Ventura L (2018) Bootstrap adjustments of signed scoring rule root statistics. Comm Statist - Simul Comput 47:1204–1215
Mameli V, Ventura L (2015) Higher-order asymptotics for scoring rules. J Stat Plan Inf 165:13–26
Article MathSciNet MATH Google Scholar
Mardia KV, Jupp PE (2000) Directional statistics. Wiley, London
MATH Google Scholar
Mardia KV, Kent JT, Laha AK (2016) Score matching estimators for directional distributions. ArXiv:1604.08470v1
Musio M, Mameli V, Ruli E, Ventura L (2017) Bayesian Inference for directional data through ABC and homogeneous proper scoring rules. Proceeding of 61$st$ ISI World Statistics Congress, to appear
Pace L, Salvan A, Sartori N (2011) Adjusting composite likelihood ratio statistics. Stat Sin 21:129–148
MathSciNet MATH Google Scholar
Parry M, Dawid AP, Lauritzen SL (2012) Proper local scoring rules. Ann Stat 40:561–592
Article MathSciNet MATH Google Scholar
Pauli F, Racugno W, Ventura L (2011) Bayesian composite marginal likelihoods. Stat Sin 21:149–164
MathSciNet MATH Google Scholar
Riani M, Atkinson AC, Perrotta D (2014) A parametric framework for the comparison of methods of very robust regression. Stat Sci 29:128–143
Article MathSciNet MATH Google Scholar
Ribatet M, Cooley D, Davison AC (2012) Bayesian inference from composite likelihoods, with an application to spatial extremes. Stat Sin 22:813–845
MathSciNet MATH Google Scholar
Ruli E, Sartori N, Ventura L (2016) Approximate Bayesian Computation with composite score functions. Stat Comput 26:679–692
Article MathSciNet MATH Google Scholar
Schennach S (2005) Bayesian exponentially tilted empirical likelihood. Ann Stat 35:634–672
Article MathSciNet MATH Google Scholar
Smith EL, Stephenson AG (2009) An extended Gaussian max-stable process model for spatial extremes. J Stat Plan Inf 139:1266–1275
Article MathSciNet MATH Google Scholar
Tsallis C (1988) Possible generalization of Boltzmann–Gibbs statistics. J Stat Phys 52:479–487
Article MathSciNet MATH Google Scholar
Varin C, Reid N, Firth D (2011) An overview of composite likelihood methods. Stat Sin 21:5–42
MathSciNet MATH Google Scholar
Ventura L, Cabras S, Racugno W (2010) Default prior distributions from quasi- and quasi-profile likelihoods. J Stat Plan Inf 140:2937–2942
Article MathSciNet MATH Google Scholar
Ventura L, Racugno W (2016) Pseudo-likelihoods for Bayesian inference. Topics on methodological and applied statistical inference, series studies in theoretical and applied statistics. Springer, Berlin, pp 205–220
Google Scholar
Walker SG (2016) Bayesian information in an experiment and the Fisher information distance. Stat Probab Lett 112:5–9
Article MathSciNet MATH Google Scholar
Yang Y, He X (2012) Bayesian empirical likelihood for quantile regression. Ann Stat 40:1102–1131
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy
F. Giummolè & V. Mameli
Department of Statistical Sciences, University of Padova, Padua, Italy
E. Ruli & L. Ventura

Authors

F. Giummolè
View author publications
You can also search for this author in PubMed Google Scholar
V. Mameli
View author publications
You can also search for this author in PubMed Google Scholar
E. Ruli
View author publications
You can also search for this author in PubMed Google Scholar
L. Ventura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to E. Ruli.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5845 KB)

Appendix

1.1 Proof of Theorem 3.1

Proof

The posterior (5) can equivalently be written as

$$\begin{aligned} \pi _{SR} (\theta |x)=\frac{\pi (\theta ) \exp {\{-S (\theta ^*)+S(\tilde{\theta })\}}}{\int _\Theta \pi (\theta ) \exp {\{-S (\theta ^*)+S(\tilde{\theta })\}} {\hbox {d}}\theta }, \end{aligned}$$

with $\theta ^*=\tilde{\theta }+C(\theta -\tilde{\theta })$ and C fixed such that $C^TK(\theta )C=G(\theta )$.

Let $w=(w^1,\ldots ,w^d)^T=n^{1/2}(\theta -\tilde{\theta })$. Then, $\theta =\tilde{\theta }+n^{-1/2}w$ and $\theta ^*=\tilde{\theta }+n^{-1/2}Cw$. A posterior for w is

$$\begin{aligned} \pi _{SR} (w|x)=\frac{b(w,x)}{\int b(w,x) {\hbox {d}}w}, \end{aligned}$$

with

$$\begin{aligned} b(w,x)= \pi (\tilde{\theta }+n^{-1/2}w) \exp {\{-S (\tilde{\theta }+n^{-1/2}Cw)+S(\tilde{\theta })\}}. \end{aligned}$$

Now

$$\begin{aligned} \pi (\tilde{\theta }+n^{-1/2}w)= \tilde{\pi }\left( 1+n^{-1/2}R_1(w)+\frac{1}{2}n^{-1}R_2(w)\right) +O_p(n^{-3/2}), \end{aligned}$$

with

$$\begin{aligned} R_1(w)=\frac{\tilde{\pi }_i}{\tilde{\pi }}w^i \quad \quad \text{ and } \quad \quad R_2(w)=\frac{\tilde{\pi }_{ij}}{\tilde{\pi }}w^{ij}, \end{aligned}$$

where $w^{ij\ldots }=w^iw^j\ldots $ is a product of components of w. Moreover,

$$\begin{aligned} -S (\tilde{\theta }+n^{-1/2}Cw)+S(\tilde{\theta })= & {} -n^{-1}\left( \frac{1}{2}(Cw)^{ij}\tilde{S}_{ij} +\frac{1}{6}n^{-1/2}(Cw)^{ijk}\tilde{S}_{ijk}\right. \nonumber \\&\left. +\,\frac{1}{24}n^{-1}(Cw)^{ijkh}\tilde{S}_{ijkh}\right) +O_p(n^{-3/2}) \nonumber \\= & {} -\,\frac{1}{2}n^{-1}w^T\left( C^T\frac{\partial ^2\tilde{S}}{\partial \theta \partial \theta ^T}C\right) w-\frac{1}{6}n^{-1/2}R_3(w)\nonumber \\&-\,\frac{1}{24}n^{-1}R_4(w) +O_p(n^{-3/2})\nonumber \\= & {} -\,\frac{1}{2}w^T\tilde{H} w-\frac{1}{6}n^{-1/2}R_3(w)-\frac{1}{24}n^{-1}R_4(w)\nonumber \\&+\,O_p(n^{-3/2}), \end{aligned}$$

(15)

where

$$\begin{aligned} R_3(w)=n^{-1}(Cw)^{ijk}\tilde{S}_{ijk} \quad \quad \text{ and } \quad \quad R_4(w)=n^{-1}(Cw)^{ijkh}\tilde{S}_{ijkh}. \end{aligned}$$

The numerator in Bayes formula can thus be written as

$$\begin{aligned} b(w,x)= & {} \tilde{\pi }\left( 1+n^{-1/2}R_1(w) +\frac{1}{2}n^{-1} R_2(w)\right) \nonumber \\&\exp {\left\{ -\frac{1}{2}w^T\tilde{H} w \right\} }\left( 1{-}\frac{1}{6}n^{-1/2}R_3(w){-}\frac{1}{24}n^{-1}R_4(w) {+}\frac{1}{72}n^{-1}R_3(w)^2\right) \nonumber \\&+O_p(n^{-3/2})= \tilde{\pi }\exp {\left\{ -\frac{1}{2}w^T\tilde{H} w \right\} }\left[ 1+n^{-1/2}\left( R_1(w)-\frac{1}{6}R_3(w)\right) \right. \nonumber \\&\left. +\,n^{-1}\left( \frac{1}{2}R_2(w) -\frac{1}{6}R_1(w)R_3(w)-\frac{1}{24}R_4(w)+ \frac{1}{72}R_3(w)^2 \right) \right] \nonumber \\&+\,O_p(n^{-3/2}). \end{aligned}$$

(16)

The denominator can be approximated using the moments of the d-variate normal distribution $N_d(0,\tilde{H}^{-1})$. We have that

$$\begin{aligned} \int b(w,x) dw= & {} \tilde{\pi }(2\pi )^{d/2}|\tilde{H}|^{-1/2} \left[ 1+n^{-1}\left( \frac{1}{2}E(R_2(W)) -\frac{1}{6}E(R_1(W)R_3(W))\right. \right. \nonumber \\&\left. \left. -\frac{1}{24}E(R_4(W))+ \frac{1}{72}E(R_3(W)^2) \right) \right] + o(n^{-1}), \end{aligned}$$

(17)

being $E(R_1(W))$ and $E(R_3(W))$ both related to odd moments and thus equal to zero. Now,

$$\begin{aligned} E(R_2(W))= & {} \frac{\tilde{\pi }_{ij}}{\tilde{\pi }} \tilde{h}^{ij} , \quad \quad E(R_1(W)R_3(W))= 3\frac{\tilde{\pi }_{i}}{\tilde{\pi }} \frac{\tilde{S}_{jkh}}{n}c_{jr}c_{ks}c_{ht}\tilde{h}^{ir}\tilde{h}^{st} ,\\ E(R_4(W))= & {} 3 \frac{\tilde{S}_{ijkh}}{n}c_{ir}c_{js}c_{kt}c_{hu}\tilde{h}^{rs}\tilde{h}^{tu} \end{aligned}$$

and

$$\begin{aligned} E(R_3(W)^2)= & {} \frac{\tilde{S}_{ijk}\tilde{S}_{rst}}{n^2}(9c_{ia}c_{jb}c_{kc}c_{rd}c_{se}c_{tf}\tilde{h}^{ab}\tilde{h}^{cd}\tilde{h}^{ef}\\&+6c_{ia}c_{rb}c_{jc}c_{sd}c_{ke}c_{tf}\tilde{h}^{ab}\tilde{h}^{cd}\tilde{h}^{ef}). \end{aligned}$$

Putting together expressions (16) and (17), we finally obtain the result with

$$\begin{aligned} A_1(w)=R_1(w)-\frac{1}{6}R_3(w) \end{aligned}$$

and

$$\begin{aligned} A_2(w)= & {} \frac{1}{2}\left( R_2(w)-E(R_2(W))\right) -\frac{1}{6}\left( R_1(w)R_3(w)-E(R_1(W)R_3(W))\right) \\&-\,\frac{1}{24}\left( R_4(w)-E(R_4(W))\right) +\frac{1}{72}\left( R_3(w)^2-E(R_3(W)^2)\right) . \end{aligned}$$

$\square $

1.2 Proof of Theorem 4.1

Proof

The proof follows the same steps as in Liu et al. (2014), Section 3, generalized for the use of a SR-posterior (5) instead of the classic posterior distribution.

Let $\pi (\theta )$ be a prior distribution for $\theta $ and $p(x|\theta )$ be the conditional distribution of X given $\theta $. Thus, the marginal distribution of X can be written as $p(x)=p(x|\theta )\pi (\theta )/\pi (\theta |x)$, with $\pi (\theta |x)$ the conditional distribution of $\theta $ given x. The functional (8) associated with an $\alpha $-divergence between the prior $\pi (\theta )$ and the SR-posterior (5) can be written as

$$\begin{aligned} T(\pi )= & {} \frac{1}{\alpha (1-\alpha )}\left[ 1- \int _{\mathcal {X}} \left( \int _{\Theta } \pi (\theta )^\alpha \pi _{SR}(\theta |x)^{1-\alpha } d\theta \right) p(x) {\hbox {d}}x \right] \nonumber \\= & {} \frac{1}{\alpha (1-\alpha )}\left[ 1- \int _{\Theta } \left( \int _{\mathcal {X}} \pi (\theta )^\alpha \pi _{SR}(\theta |x)^{1-\alpha } \pi (\theta |x)^{-1}p(x|\theta ){\hbox {d}}x \right) \pi (\theta ){\hbox {d}}\theta \right] \nonumber \\= & {} \frac{1}{\alpha (1-\alpha )}\left[ 1-\int _{\Theta } \pi (\theta )^{\alpha +1}E_\theta \left[ \pi _{SR}(\theta |X)^{1-\alpha }\pi (\theta |X)^{-1}\right] {\hbox {d}}\theta \right] , \end{aligned}$$

(18)

where $E_\theta (\cdot )$ denotes expectation with respect to the conditional distribution of X given $\theta $. For $\alpha =0$ or 1, we need to interpret $T(\pi )$ as its limiting value, when it exists.

Now, we apply the shrinkage argument for evaluating the conditional expectation in (18), $E_\theta \left[ \pi _{SR}(\theta |X)^{1-\alpha }\pi (\theta |X)^{-1}\right] $. See Datta and Mukerjee (2004) for details on the shrinkage method.

Let us first consider the case of $0<|\alpha |<1$.

The first step of the shrinkage method involves fixing a suitable prior distribution $\bar{\pi }(\theta )$ and calculating the expected value of $\pi _{SR}(\theta |x)^{1-\alpha }\pi (\theta |x)^{-1}$ with respect to the corresponding posterior $\bar{\pi }(\theta |x)$, i.e. $E^{\bar{\pi }}\left[ \pi _{SR}(\theta |x)^{1-\alpha }\pi (\theta |x)^{-1}|x\right] $. Notice that here we use the classic posterior based on the prior $\bar{\pi }(\theta )$, i.e. the conditional distribution of $\theta $ given x, $\bar{\pi }(\theta |x)$. Using up to first order the asymptotic expansion (7) for $\pi _{SR}(\theta |x)$ and similar expansions for $\bar{\pi }(\theta |x)$ and $\pi (\theta |x)$, we obtain

$$\begin{aligned} \pi _{SR}(\theta |X)^{1-\alpha }=\left( \frac{n^d|\tilde{H}|}{(2\pi )^d}\right) ^{(1-\alpha )/2} \exp {\left\{ \frac{n(1-\alpha )}{2}(\theta -\tilde{\theta })^T\tilde{H}(\theta -\tilde{\theta })\right\} }+O_p(n^{-1/2}) \end{aligned}$$

and

$$\begin{aligned} \pi (\theta |X)^{-1}\bar{\pi }(\theta |X)=1 +O_p(n^{-1/2}). \end{aligned}$$

Since $\tilde{H}$ tends to G as $n\rightarrow \infty $, we have that

$$\begin{aligned} E^{\bar{\pi }}\left[ \pi _{SR}(\theta |x)^{1-\alpha }\pi (\theta |x)^{-1}|x\right]= & {} \left( \frac{n^d|\tilde{H}|}{(2\pi )^d}\right) ^{(1-\alpha )/2}\\&\int _{\Theta } \exp {\left\{ \frac{n(1-\alpha )}{2}(\theta -\tilde{\theta })^T\tilde{H}(\theta -\tilde{\theta })\right\} } d\theta \\&+\,O(n^{-1/2})\\= & {} \left( \frac{n^d|G(\theta )|}{(2\pi )^d}\right) ^{-\alpha /2}(1-\alpha )^{-d/2}+O(n^{-1/2}). \end{aligned}$$

The second step of the shrinkage argument requires to integrate again with respect to the distribution of X given $\theta $ and with respect to the prior $\bar{\pi }(\theta )$, thus obtaining

$$\begin{aligned}&\int _\Theta E_\theta \left[ E^{\bar{\pi }}\left[ \pi _{SR}(\theta |X)^{1-\alpha }\pi (\theta |X)^{-1}|X\right] \right] \bar{\pi }(\theta ) {\hbox {d}}\theta \\&\quad =\int _\Theta \left( \frac{n^d|G(\theta )|}{(2\pi )^d}\right) ^{-\alpha /2}(1-\alpha )^{-d/2} \, \bar{\pi }(\theta ) \, {\hbox {d}}\theta +O(n^{-1/2}). \end{aligned}$$

Finally, by letting the prior $\bar{\pi }(\theta )$ go to $\theta $, we obtain

$$\begin{aligned} E_\theta \left[ \pi _{SR}(\theta |X)^{1-\alpha }\pi (\theta |X)^{-1}\right]= & {} \left( \frac{n}{2\pi }\right) ^{-d\alpha /2}|G(\theta )|^{-\alpha /2}(1-\alpha )^{-d/2}+O(n^{-1/2}).\nonumber \\ \end{aligned}$$

(19)

For $0<|\alpha |<1$, by substituting (19) in (18) we can see that the selection of a prior $\pi (\theta )$ corresponds to the minimization with respect to $\pi (\theta )$ of the functional

$$\begin{aligned} \frac{1}{\alpha (1-\alpha )}\int _\Theta \pi (\theta )^{\alpha +1}|G(\theta )|^{-\alpha /2}d\theta . \end{aligned}$$

(20)

Notice that the preceding expression can be interpreted as an increasing transformation of the $(-\alpha )$-divergence between a density that is proportional to $|G(\theta )|^{1/2}$ and the prior $\pi (\theta )$. Thus, it is minimized if and only if the two densities coincide.

This result cannot be extended to the case of $\alpha \ge 1$, as is evident from the right-hand side of (19). For $\alpha <-1$, (20) turns out to be equivalent to a decreasing transformation of the $(-\alpha )$-divergence between a density that is proportional to $|G(\theta )|^{1/2}$ and the prior $\pi (\theta )$. Thus, a maximizer for the expected $\alpha $-divergence in this case does not exist.

For $\alpha \rightarrow 0$, following the same steps as for $0<|\alpha |<1$, it can be easily shown that maximization of the average Kullback–Leibler divergence is asymptotically equivalent to minimization of

$$\begin{aligned} \int _\Theta \pi (\theta ) \log \left( \frac{\pi (\theta )}{|G(\theta )|^{1/2}}\right) {\hbox {d}}\theta , \end{aligned}$$

which is attained by choosing again a Jeffreys-type prior proportional to $ |G(\theta )|^{1/2}$. Indeed, the preceding expression is equal up to an additive constant to Kullback–Leibler divergence between a density that is proportional to $|G(\theta )|^{1/2}$ and the prior $\pi (\theta )$.

The case $\alpha =-1$ corresponds to the Chi-square divergence. For this case the proof requires higher-order terms in the expansion of the scoring rule posterior distribution and also of both the conditional distributions of $\theta $ given x calculated with respect to the priors $\pi (\theta )$ and $\bar{\pi }(\theta )$. Thus, we use (7) up to order $O_p(n^{-3/2})$. In particular, let $w=n^{1/2}(\theta -\tilde{\theta })$ and $w_1=n^{1/2}(\theta -\hat{\theta })$, where $\tilde{\theta }$ is the minimum scoring rule estimator and $\hat{\theta }$ the maximum likelihood estimator for $\theta $. The different posterior distributions for w and $w_1$ given x can be written as

$$\begin{aligned} \pi _{SR} (w|x)= & {} \phi _d(w;\tilde{H}^{-1}) \left[ 1+n^{-1/2}A_1(w) + n^{-1}A_2(w)\right] +O_p(n^{-3/2}),\\ \pi (w_1|x)= & {} \phi _d(w_1;(\hat{J}^\ell )^{-1}) \left[ 1+n^{-1/2}A^\ell _1(w_1) + n^{-1}A^\ell _2(w_1)\right] +O_p(n^{-3/2}),\\ \bar{\pi }(w_1|x)= & {} \phi _d(w_1;(\hat{J}^\ell )^{-1}) \left[ 1+n^{-1/2}\bar{A}^\ell _1(w_1) + n^{-1}\bar{A}^\ell _2(w_1)\right] +O_p(n^{-3/2}), \end{aligned}$$

where $A_1$ and $A_2$ are defined as in Theorem 3.1, $J^\ell (\theta )=(-\partial ^2\ell /\partial \theta \partial \theta ^T)/n$ is the observed information and $A^\ell _1$, $\bar{A}^\ell _1$, $A^\ell _2$, $\bar{A}^\ell _2$ are obtained by calculating $A_1$ and $A_2$ with the logarithmic score as scoring rule and $\pi $ and $\bar{\pi }$ as priors; for instance,

$$\begin{aligned} A^\ell _1(w_1)= \frac{\hat{\pi }_i}{\hat{\pi }}w_1^i+\frac{1}{6}\frac{\hat{\ell }_{ijk}}{n}w_1^iw_1^jw_1^k, \quad \text{ and }\quad \bar{A}^\ell _1(w_1)= \frac{\hat{\bar{\pi }}_i}{\hat{\bar{\pi }}}w_1^i+\frac{1}{6}\frac{\hat{\ell }_{ijk}}{n}w_1^iw_1^jw_1^k, \end{aligned}$$

with $\ell $ the log-likelihood and $\ell _{ijk}$ its third-order derivatives. As usual, a tilde or a hat over a quantity indicate that it is calculated at $\tilde{\theta }$ and $\hat{\theta }$, respectively.

Following the shrinkage argument, we need first to evaluate

$$\begin{aligned} E^{\bar{\pi }}\left[ \pi _{SR}(\theta |x)^2 \pi (\theta |x)^{-1}|x\right]= & {} n^{d/2} \int \pi _{SR}(w|x)^2 \pi (w|x)^{-1} \bar{\pi }(w|x) {\hbox {d}}w. \end{aligned}$$

For doing this, we use the fact that

$$\begin{aligned} \pi _{SR} (w|x)^2= & {} (2\pi )^{-d/2} |\tilde{H}|^{1/2} 2^{-d/2} \phi _d(w;\tilde{H}^{-1}/2) \\&\left[ 1+n^{-1/2}2A_1(w) + n^{-1}(2A_2(w)+A_1(w)^2)\right] +O_p(n^{-3/2}) \end{aligned}$$

and

$$\begin{aligned} \pi (w_1|x)^{-1} \bar{\pi }(w_1|x)= & {} 1+n^{-1/2}(\bar{A}^\ell _1(w_1)-A^\ell _1(w_1)) \\&+\, n^{-1}(A^\ell _1(w_1)^2+\bar{A}^\ell _2(w_1)-A^\ell _2(w_1)-\bar{A}^\ell _1(w_1) A^\ell _1(w_1)) \\&+\,O_p(n^{-3/2}) \end{aligned}$$

and we calculate the last expression in $w_1=w+n^{1/2}(\tilde{\theta }-\hat{\theta })$.

After tedious calculations, we obtain

$$\begin{aligned}&E^{\bar{\pi }}\left[ \pi _{SR}(\theta |x)^2 \pi (\theta |x)^{-1}|x\right] \\&\quad = (2\pi )^{-d/2}\left( \frac{n}{2}\right) ^{d/2} |\tilde{H}|^{1/2} \left\{ 1 + \frac{1}{n^{1/2}} \left( \frac{\hat{\bar{\pi }}_{i}}{\hat{\bar{\pi }}} -\frac{\hat{\pi }_{i}}{\hat{\pi }} \right) n^{1/2}(\tilde{\theta }-\hat{\theta })^i \right. \\&\quad \left. +\, \frac{1}{2n} \left[ \left( -\frac{\hat{\bar{\pi }}_{ij}}{\hat{\bar{\pi }}} +\frac{\hat{\pi }_{ij}}{\hat{\pi }} \right) \hat{\varvec{j}}^{ij} \right. \right. \\&\quad \left. \left. +\,\left( -\frac{\tilde{\pi }_{ij}}{\tilde{\pi }}+\frac{\tilde{\pi }_{i}}{\tilde{\pi }}\frac{\tilde{\pi }_{j}}{\tilde{\pi }} +2\frac{\tilde{\pi }_{i}}{\tilde{\pi }}\frac{\hat{\bar{\pi }}_{j}}{\hat{\bar{\pi }}} -2\frac{\tilde{\pi }_{i}}{\tilde{\pi }}\frac{\hat{\pi }_{j}}{\hat{\pi }} +\frac{\hat{\pi }_{i}}{\hat{\pi }}\frac{\hat{\pi }_{j}}{\hat{\pi }} -\frac{\hat{\pi }_{i}}{\hat{\pi }}\frac{\hat{\bar{\pi }}_{j}}{\hat{\bar{\pi }}} +\frac{1}{2}\frac{\hat{\bar{\pi }}_{ij}}{\hat{\bar{\pi }}} -\frac{1}{2}\frac{\hat{\pi }_{ij}}{\hat{\pi }} \right) \tilde{h}^{ij} \right. \right. \\&\quad \left. \left. +\,\left( \frac{\tilde{\pi }_{i}}{\tilde{\pi }} -\frac{1}{2}\frac{\hat{\bar{\pi }}_{i}}{\hat{\bar{\pi }}} +\frac{1}{2}\frac{\hat{\pi }_{i}}{\hat{\pi }}\right) \frac{\tilde{S}_{khl}}{n} c_{kj}c_{hs}c_{lt}\tilde{h}^{st}\tilde{h}^{ij} +\left( -\frac{\hat{\bar{\pi }}_{i}}{\hat{\bar{\pi }}} +\frac{\hat{\pi }_{i}}{\hat{\pi }}\right) \frac{\tilde{\ell }_{jhk}}{n} \hat{\varvec{j}}^{hk}\hat{\varvec{j}}^{ij} \right. \right. \\&\quad \left. \left. +\, \left( -\frac{\hat{\pi }_{ij}}{\hat{\pi }}+\frac{\hat{\bar{\pi }}_{ij}}{\hat{\bar{\pi }}} +2\frac{\hat{\pi }_{i}}{\hat{\pi }} \frac{\hat{\pi }_{j}}{\hat{\pi }} -2\frac{\hat{\pi }_{i}}{\hat{\pi }} \frac{\hat{\bar{\pi }}_{j}}{\hat{\bar{\pi }}} \right) n(\tilde{\theta }-\hat{\theta })^{ij} \right] + R \right\} + O(n^{-1}), \end{aligned}$$

where $\hat{\varvec{j}}^{ij}$ are the components of $(\hat{J}^\ell )^{-1}$ and $R=R(\tilde{\theta },\hat{\theta })$ is a function that does not involve the prior nor its derivatives.

Now, let $E_\theta (S_{ijk}/n)=B^S_{ijk}(\theta )+o(n^{-1/2})$, $E_\theta (\ell _{ijk}/n)=B^\ell _{ijk}(\theta )+o(n^{-1/2})$ and recall that $E_\theta (\tilde{H})=G(\theta )+o(n^{-1/2})$ and $E_\theta (\hat{J}^\ell )=I(\theta )+o(n^{-1/2})$, where I is the expected information matrix. Moreover, note that $E_\theta [(\tilde{\theta }-\hat{\theta })^i ]=o(n^{-1/2})$ and $E_\theta [n(\tilde{\theta }-\hat{\theta })^{ij}]=g^{ij}+{\varvec{i}}^{ij}-2\sigma ^{ij}+o(n^{-1/2})$, where $g_{ij}$ and $g^{ij}$ are the components of G and $G^{-1}$, respectively, ${\varvec{i}}^{ij}$ are the components of $I^{-1}$ and $\sigma ^{ij}=nCov_\theta (\tilde{\theta },\hat{\theta })^{ij}$. Furthermore, let $a^S_j=B^S_{khl}c_{kj}c_{hs}c_{lt}g^{st}$, $a^\ell _j=B^\ell _{jhk}{\varvec{i}}^{hk}$. By integrating again the above expression with respect to the distribution of X given $\theta $, we get

$$\begin{aligned}&E_\theta \left[ E^{\bar{\pi }}\left[ \pi _{SR}(\theta |X)^2 \pi (\theta |X)^{-1}|X\right] \right] \\&\quad = (2\pi )^{-d/2}\left( \frac{n}{2}\right) ^{d/2} |G|^{1/2} \left\{ 1 + \frac{1}{2n} \left[ \left( -\frac{\bar{\pi }_{ij}}{\bar{\pi }} +\frac{\pi _{ij}}{\pi } \right) {\varvec{i}}^{ij} \right. \right. \\&\quad \left. \left. +\,\left( -\frac{3}{2}\frac{\pi _{ij}}{\pi }+\frac{\pi _{i}}{\pi }\frac{\bar{\pi }_{j}}{\bar{\pi }} +\frac{1}{2}\frac{\bar{\pi }_{ij}}{\bar{\pi }} \right) g^{ij} \right. \right. \\&\quad \left. \left. +\,\left( \frac{3}{2}\frac{\pi _{i}}{\pi } -\frac{1}{2}\frac{\bar{\pi }_{i}}{\bar{\pi }} \right) a^S_j g^{ij} -\left( \frac{\bar{\pi }_{i}}{\bar{\pi }} -\frac{\pi _{i}}{\pi }\right) a^\ell _j {\varvec{i}}^{ij} \right. \right. \\&\quad \left. \left. +\, \left( -\frac{\pi _{ij}}{\pi }+\frac{{\bar{\pi }}_{ij}}{{\bar{\pi }}} +2\frac{{\pi }_{i}}{{\pi }} \frac{{\pi }_{j}}{{\pi }} -2\frac{{\pi }_{i}}{{\pi }} \frac{{\bar{\pi }}_{j}}{{\bar{\pi }}} \right) (g^{ij}+{\varvec{i}}^{ij}-2\sigma ^{ij}) \right] + F \right\} + O(n^{-1}), \end{aligned}$$

where $F=F(\theta )$ is a function that does not involve the prior nor its derivatives.

By integrating again with respect to the prior $\bar{\pi }(\theta )$ and by letting the prior go to $\theta $, we finally obtain the expected value needed in (18) for $\alpha =-1$:

$$\begin{aligned}&\quad =(2\pi )^{-d/2}\left( \frac{n}{2}\right) ^{d/2} |G|^{1/2}\left\{ 1+\frac{1}{4n}\left[ \left( -3g^{ij} +4{\varvec{i}}^{ij}-4\sigma ^{ij}\right) \frac{\pi _{ij}}{\pi }\right. \right. \nonumber \\&\quad \left. \left. +\,\left( 3a^S_jg^{ij}+2a^\ell _j{\varvec{i}}^{ij}+|G|^{-1}\frac{\partial |G|}{\partial \theta ^j}(g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})+2 \frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j} \right) \frac{\pi _i}{\pi } \right. \right. \nonumber \\&\quad \left. \left. +\,2g^{ij}\frac{\pi _{i}}{\pi }\frac{\pi _{j}}{\pi } +M \right] \right\} +O(n^{-1}), \end{aligned}$$

(21)

where $M=M(\theta )$ is a function that does not involve the prior nor its derivatives.

Substituting (21) in (18) with $\alpha =-1$, we find that maximization of the average Chi-square divergence is equivalent to maximization of

$$\begin{aligned}&\int _\Theta |G|^{1/2}\left[ \phantom {\frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j}}\left( -3g^{ij} +4{\varvec{i}}^{ij}-4\sigma ^{ij}\right) \frac{\pi _{ij}}{\pi } +2g^{ij}\frac{\pi _{i}}{\pi }\frac{\pi _{j}}{\pi } \right. \\&\quad +\,\left( 3a^S_jg^{ij}+2a^\ell _j{\varvec{i}}^{ij}+|G|^{-1}\frac{\partial |G|}{\partial \theta ^j} \Big ({g^{ij}}g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij} \Big )\right. \\&\quad \left. \left. +\,2 \frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j} \right) \frac{\pi _i}{\pi } \right] d\theta . \end{aligned}$$

Let $y=y(\theta )=(y_1(\theta ),\dots ,y_{d}(\theta ))^T$, with $y_i(\theta )=\pi _i(\theta )/\pi (\theta )$, $i=1,\ldots , {d}$, and let $y'$ be the matrix of partial derivatives of y with respect to $\theta $, i.e. $y'=(y_{ij})_{ij}$, $y_{ij}=\partial y_i/\partial \theta ^j$, $i,j=1,\ldots { d}$. Then, $\pi _{ij}/\pi =y_{ij}+y_iy_j$ and the quantity to be maximized can be rewritten as

$$\begin{aligned}&\int _\Theta |G|^{1/2} \left[ \phantom {\frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j}}\left( -3g^{ij} +4{\varvec{i}}^{ij}-4\sigma ^{ij}\right) y_{ij}+\left( -g^{ij} +4{\varvec{i}}^{ij}-4\sigma ^{ij}\right) y_iy_j\right. \\&\quad +\,\left( 3a^S_jg^{ij}+2a^\ell _j{\varvec{i}}^{ij}+|G|^{-1}\frac{\partial |G|}{\partial \theta ^j}(g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij} ) \right. \\&\quad \left. \left. +\,2 \frac{\partial (g^{ij}+2{\varvec{i}}^{ij}-4\sigma ^{ij})}{\partial \theta ^j} \right) y_i \right] {\hbox {d}}\theta . \end{aligned}$$

Let us denote the integrand function by $U(\theta ,y,y')$. The solution to the maximization problem is found by solving the system of Euler–Lagrange equations:

$$\begin{aligned} \frac{\partial U}{\partial y_i}-\sum _{j=1}^d\frac{\partial }{\partial \theta ^j}\left( \frac{\partial U}{\partial y_{ij}} \right) =0, \quad i=1,\ldots ,{d}. \end{aligned}$$

After some calculations, we obtain the solution to the variational problem as

$$\begin{aligned} y(\theta )=\frac{1}{4}\left[ 6a^S G^{-1} + 4a^\ell I^{-1}+ |G|^{-1}\frac{\partial |G|}{\partial \theta }(5G^{-1}-4\Sigma ) +2G \nabla \cdot (5G^{-1}-4\Sigma ) \right] \Gamma , \end{aligned}$$

where $a^S=(a^S_1,\ldots ,a^S_{d})^T$, $a^\ell =(a^\ell _1,\ldots ,a^\ell _{d})^T$, $\Sigma =(\sigma ^{ij})_{ij}$, $\Gamma =(G^{-1}-4I^{-1}+4\Sigma )^{-1}$, $\nabla \cdot G^{-1}= (\partial g^{1j}/\partial \theta ^j, \ldots , \partial g^{{d} j}/\partial \theta ^j)^T$ and $\nabla \cdot \Sigma = (\partial \sigma ^{1j}/\partial \theta ^j, \ldots , \partial \sigma ^{{d} j}/\partial \theta ^j)^T$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giummolè, F., Mameli, V., Ruli, E. et al. Objective Bayesian inference with proper scoring rules. TEST 28, 728–755 (2019). https://doi.org/10.1007/s11749-018-0597-z

Download citation

Received: 02 December 2017
Accepted: 18 June 2018
Published: 12 July 2018
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s11749-018-0597-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Objective Bayesian inference with proper scoring rules

Abstract

Access this article

Similar content being viewed by others

Bayesian Variable Selection for Linear Models Using I-Priors

Detecting and diagnosing prior and likelihood sensitivity with power-scaling

Pseudo-Likelihoods for Bayesian Inference

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 5845 KB)

Appendix

1.1 Proof of Theorem 3.1

Proof

1.2 Proof of Theorem 4.1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Objective Bayesian inference with proper scoring rules

Abstract

Access this article

Similar content being viewed by others

Bayesian Variable Selection for Linear Models Using I-Priors

Detecting and diagnosing prior and likelihood sensitivity with power-scaling

Pseudo-Likelihoods for Bayesian Inference

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 5845 KB)

Appendix

Appendix

1.1 Proof of Theorem 3.1

Proof

1.2 Proof of Theorem 4.1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation