1 Introduction

In most contexts, existing data are not considered sufficient to implement effective policies and/or to make the best management decisions. In these cases, decision makers make use of other sources of information, among which the judgments of experts are widely used (Colson & Cooke, 2018). Experts have become indispensable in organizations because they “fill gaps in data and in the understanding of existing or missing data” (Benini et al., 2017, p. 1). From marketing (Larréché & Moinpour, 1983) to project management (Szwed, 2016), medicine (Bojke et al., 2021) to the economy (Usher & Strachan, 2013), and natural resource management (Hemming et al., 2018) to risk assessment (Hanea et al., 2021) and terrorism (Gordon et al., 2015), the use of experts and their structured judgments has become widespread.

Whether the subjective opinions coming from expert evaluations have scientific value or not is still under debate, but all authors agree that the judgments of a group are better than those of a single expert. Furthermore, both the way in which the judgments are collected and the procedure by which they are aggregated to form a single evaluation are considered crucial (O’Hagan, 2019). If within a properly built panel of experts, judgments are collected and aggregated according to validated protocols, then the subjective data have scientific validity, just like any other type of data (O’Hagan, 2019).

Expert judgments are widely used in various fields, and when dealing with future scenarios (and more generally in futures studies), since data do not exist with regard to the future, the use of experts is essential. In many futures studies (FS) and strategic foresight methodologies, experts are used extensively, and the main problem remains that of using a formalized and structured procedure for first collecting and then combining experts’ judgments.

Statistically speaking, experts’ judgments are subjective data and should be gathered and treated as carefully and scientifically as possible (O’Hagan, 2019). For collecting subjective data, informally asking a single expert, interviewing several experts or arranging a focus group are useful techniques, but it is widely recognized that numerous cognitive biases invalidate the judgments made by experts during these procedures (see, among others, Bonaccorsi et al., 2020 and O’Hagan, 2019). The following is a short list of the most common distortions that occur during face-to-face meetings: 1) Leadership: when the highest-ranking person expresses an opinion, the others usually tend to follow him/her, and the risk is that many participants do not feel free to express their opinions for fear of conflict with the leader. 2) The spiral of silence: those whose opinions are in line with the majority feel more confident in expressing their views, while others fear that sharing their opinions could cause social ostracism. Therefore, the latter remain silent, and this leads to a spiralling process, in which the opinions of the minority are more and more marginalized and suppressed (Di Zio & Staniscia, 2014). 3) Groupthink: the pressure to conform within a group interferes with the correct analysis of the problem and produces poor group decisions. In other words, groupthink involves distortion to minimize conflicts and to avoid challenging others’ views, during which individuals withhold their personal opinions (McCauley, 1998). 4) Group polarization: a distortion that affects people who work in groups and pushes them towards accepting opinions that are more extreme than their own individual beliefs (Isenberg, 1986).

To capture expert knowledge as objectively as possible, the gathering process must be structured carefully to avoid (or at least to minimize) such biases (O’Hagan, 2019). In the relevant literature, a number of procedures—actual elicitation protocols—have been developed that attempt to mitigate the distortions that are triggered when the judgments are collected and which can compromise the validity of the subjective estimates.

The technical activity that leads an expert to form and express an opinion is called expert elicitation (O’Hagan, 2019). Apart from the elicitation situation (e.g. interactive expert groups or Delphi-like procedures) and the mode of collecting the judgments (e.g. telephone or computer assisted), there are two crucial phases to any expert elicitation: a) the elicitation technique and b) the aggregation of views. A multiplicity of experts is preferable to a single expert because it guarantees the diversity of knowledge and background but involves the problem of aggregating the different points of view into a single solution, which implies the combination of several pieces of data (Meyer & Booker, 1991). The relevant literature distinguishes two main approaches of aggregation, known as behavioural—characterized by interactions among experts that seek consensusand mathematical, in which separate judgments are obtained from non-interacting experts and then combined using a mathematical formula (Clemen & Winkler, 1999; Colson & Cooke, 2018; Meyer & Booker, 1991).

Many authors recommend the validation of expert judgments (Cooke, 1991; Colson & Cooke, 2018), which means, on the one hand, that the judgments should reflect the beliefs of the experts (and therefore be sheltered from cognitive biases) and, on the other, that these beliefs should reflect reality. Since the former cannot be measured, validation consists of a comparison between the elicited judgments and the observed data, but this is only possible if observed data exist (Colson & Cooke, 2018). In the case of future scenarios and, more generally, in FS, the crucial point is that no data exist with regard to the future. The French philosopher de Jouvenel (1967) points out the difference between accomplished facts, which have taken a form that is no longer modifiable (which he calls facta), and what is instead in the making and can still be realized in different forms (called, by contrast, futura). Therefore, the main problem with the use of expert opinions in FS, like in many other fields, remains that of collecting and aggregating the judgments.

The Delphi method is one of the most widely used techniques for gathering subjective data from groups of experts (Dalkey & Helmer, 1963; Linstone & Turoff, 1975). In the context of FS, the Delphi method is very popular and is often used in combination with the scenario method. A future scenario is a description of a possible future situation with the paths of development leading to that future (Kosow & Gaßner, 2008). When the outputs of Delphi are used as inputs for scenario building, so-called Delphi-based scenarios provide the context (von der Gracht & Darkow, 2010; Di Zio et al., 2021), and this paper will use this approach. While the Delphi technique is considered one of the most robust methods for the elicitation of subjective data, when moving from the Delphi outputs to the scenarios, the crucial steps are precisely those of the aggregation of expert judgments and the statistical data processing necessary for the subsequent steps of the scenario development.

In short, subjective data from experts are important and sometimes needful, but if not collected, normalized, aggregated and treated correctly, they can boomerang on the goals and the results of the research. In the words of Ayyub (2001), they can be a double-edged weapon:

“Experts, despite their importance and value, can be double-edged swords. They can make valuable contributions from their deep base of knowledge, but those contributions may also contain their own biases and pet theories. Therefore, selecting experts, eliciting their opinions, and aggregating their opinions must be performed and handled carefully, with full recognition of the uncertainties inherent in those opinions.”

In this paper, we propose a new method for aggregating and processing subjective data collected with Delphi with the aim of obtaining robust rankings of Delphi projections that are useful for the subsequent phase of future scenario development. As we will see, the method allows the aggregation and normalization of the opinions of a panel of experts while modelling the various sources of uncertainty. Our proposal is based on uncertainty analysis, which implies the simultaneous use of different aggregation and normalization functions, so the result does not depend on the choice of a particular mathematical formula. This allows us to state that the method we propose in this paper preserves many of the advantages of both the mathematical and behavioural approaches, while limiting their disadvantages. The method allows all the opinions expressed by experts in the different rounds to be taken into account, but since there is no face-to-face communication, it avoids many cognitive biases (leadership, the spiral of silence, groupthink and group polarization), and in the aggregation phase, it does not require the subjective choice of a specific formula. Furthermore, with uncertainty analysis, we can also model the uncertainty related to the system of weights reflecting the different expertise of the participants as well as the expert opinion accuracy (with accuracy being defined as both under/overestimation and the precision of the expert scores). In general, it is shown that the method is very flexible because a researcher can easily incorporate his/her preferences on normalizing and aggregating data, as well as on modelling weights and score accuracy. The results of the method can be represented graphically in a very clear manner through a succession of intervals for each item. The information provided, for example, with regard to more or less overlapping intervals, as well as the lengths of the same intervals, can then be used for subsequent analyses.

A limitation, rather than a disadvantage, of the method is that in its current form it can only be applied to the ranking of the items but not to their ratings. The method is also conceptually applicable to the ratings, but at the price of reduced flexibility, because the ratings are not always comparable when the formula adopted to combine the opinions of the experts varies, unlike the rankings, which are always comparable since they are always ordinal numbers. An open problem is the full integration of the various qualitative phases (first focus groups and then the Delphi steps) within the proposed method. In fact, in this paper, while we also illustrate how it may be possible to integrate the initial phases of Delphi as well, only the last phase of Delphi provides input for our method.

By applying the method to the future of Italian families, we will show how the resulting rankings can be used as a base to build futures scenarios. By combining the Delphi method with this new robust ranking procedure, we offer a new protocol covering the elicitation, aggregation and processing of subjective data used in the construction of future scenarios. However, the method is very flexible and can be applied to the aggregation and processing of any kind of subjective judgments, even those outside the FS context.

The remainder of this paper is organized as follows: Sect. 2 contains the literature review, while Sect. 3 explains the proposed methodology. Section 4 contains a description of how the method was applied to future scenarios of Italian families along with the main results obtained. Finally, Sect. 5 contains the conclusion, limitations and proposals for future developments.

2 Technical literature review

FS exploit a wide range of methods, including some borrowed and adapted from other disciplines and others specifically designed for studying the future. It is not possible to review all the methods here, so we refer to the specialized literature (see Glenn & Gordon, 2009). However, among the best known and most widely used methods born in the very years when the future was being studied with greater scientific rigor are the Delphi and scenario techniques.

The Delphi method was developed in the 1950s at the RAND Corporation by Olaf Helmer and Norman Dalkey and then made public by Theodore J. Gordon and Olaf Helmer in 1964 (Gordon & Helmer, 1964). It is one of the most widely used and accepted techniques for gathering information from experts (Dalkey & Helmer, 1963; Linstone & Turoff, 1975), and over the course of more than half a century, since its inception, it has seen thousands of applications and numerous methodological variants. A selected panel of experts answers a series of questionnaires, containing both the research questions and feedback from prior questionnaires. Key features of the Delphi method are anonymity, iteration, controlled feedback and the statistical aggregation of responses (Linstone & Turoff, 1975; Rowe & Wright, 1999), and one of the most frequent objectives (but not necessarily the only one) is consensus on the issues under investigation. It is worth noting that a Delphi panel involves non-probability sampling techniques (e.g. purposive sampling or criterion sampling).

Although, as mentioned, there are many variations of this method, the main steps of classical Delphi are as follows: (1) the selection of the panel of experts; (2) the construction and submission of the first questionnaire; (3) the aggregation of the first-round responses, typically with appropriate statistical synthesis; (4) the submission of the second questionnaire: experts are asked to reassess their judgments regarding the same questions by considering the provided statistical synthesis; (5) the aggregation of the second-round responses; (6) the submission of the third questionnaire with the inclusion of the comments from the previous round; and (7) the iteration of steps 3–6 until a stopping criterion is reached. From the second round onwards, each expert has the chance to revise his/her evaluations, and this produces, at least in principle, a gradual reduction of the variability in the distribution of the judgments, thus triggering consensus.

Delphi logic makes it possible to overcome and/or minimize the distortions typical of face-to-face communication techniques, which are basically cognitive biases (Bonaccorsi et al., 2020; O’Hagan, 2019). Starting from the pioneering research of Tversky and Kahneman (1974), psychologists identified many biases in human judgments, and experts are not exempt from such heuristics. With the isolation and anonymity of participants and non-synchronous communication, Delphi resolves many face-to-face distortions (Di Zio et al., 2021).

Many studies demonstrate how Delphi beats many other elicitation techniques (Rowe & Wright, 1999), and in the context of FS, it is often used in combination with the scenario method. A future scenario can be defined as a description of a future situation together with the paths of development leading to it. It is a hypothetical construct and not a description of the future; therefore, the aim is not to predict the future (forecast) but rather to highlight the crucial projections of possible futures (foresight) and the key variables that will drive these developments (Kosow & Gaßner, 2008; Schoemaker, 1995). Like Delphi, the scenario method originated during the 1950s in a military context but is now commonly used for long-term planning. The goal of scenario planning is to develop a number (generally 3–4) of alternative future scenarios that taken together should cover, to a certain extent, possible, plausible, probable and surprising futures (Bishop et al., 2007; Fritschy & Spinler, 2019). Future scenarios are useful in supporting decision makers with regard to unveiling uncertainties and/or potential future threats (Bishop et al., 2007; Schoemaker, 1995).

The two methods, Delphi and scenario, are often combined, as many studies have shown that the outputs of scenario analysis can be used as inputs to improve a Delphi study, and the results of Delphi can also be used to facilitate scenario development (Nowack et al., 2011). The latter approach is also known as Delphi-based scenarios and is the most widespread. In fact, a future scenario is based on future projections, and Delphi is especially useful for the analysis of future projections (von der Gracht & Darkow, 2010). In a Delphi-based scenario procedure, a crucial phase involves the correct use of the Delphi results for the development of the scenarios. The judgments of the panel of experts resulting from the last Delphi round must be aggregated carefully to create a unique judgment for each projection, and then the aggregated results must be properly grouped to create clusters that form the basis for the development of the future scenarios (Di Zio et al., 2021).

The aggregation of the opinions of the experts who make up the panel is a problem that concerns not only Delphi-based scenarios but also all other contexts in which expert opinions are elicited. There are two main approaches to aggregation: mathematical and behavioural (Clemen & Winkler, 1999; Colson & Cooke, 2018; Meyer & Booker, 1991; O’Hagan, 2019). In the mathematical approach, single judgments are elicited from each expert on the panel and subsequently combined in a unique final solution using a mathematical formula, also called the pooling rule. In the behavioural approach, the experts of the panel are asked to discuss their opinions and then to come up with a unique final solution, which represents the consensus of the whole group.

Both approaches have advantages and disadvantages (O’Hagan, 2019). On the one hand, the mathematical approach solves the problems derived from the cognitive biases arising from face-to-face interaction (leadership, the spiral of silence, groupthink and group polarization), but, since there are many pooling rules, and no single one is considered to be the best, the researcher must make a subjective choice. Furthermore, mathematical aggregation does not highlight specific or particular opinions or the motivations of those who disagree with the majority, so the final result may not be an expression of any of the experts’ thoughts (Benini et al., 2017). On the other hand, the behavioural approach does not imply subjectively choosing an aggregation formula and allows everyone’s opinions to be taken into account. However, it involves many problems in seeking consensus among the experts and also includes the inevitable group cognitive biases. Moreover, the behavioural approach is often time consuming because generally conflicting points of view are difficult to agree on (Benini et al., 2017).

Because in the Delphi method there is no interaction between experts, and the result is the product of mathematical aggregation, it can be considered as part of the mathematical family. However, it does involve seeking consensus and entails a particular form of interaction between the participants, not face to face but through a supervised arrangement. This interaction occurs starting from the second round, when each expert receives, in the form of statistical summaries, the results of the previous consultation. From the second round onwards, the experts can provide anonymous reasons for their judgments to which, starting from the third round (and always anonymously), the others can respond. In this way, a kind of anonymous and remote debate is triggered, also called an anonymous conference (Di Zio & Pacinelli, 2019), which therefore implies a certain form of interaction. Undoubtedly, it is an interaction that, unlike face-to-face methods, eliminates all the cognitive biases typical of meetings in which people are in the same place and discussing issues face to face with limited time to finish the work. These are the reasons why some authors classify Delphi as a mathematical method (O’Hagan, 2019) and others consider it to be a behavioural method (Benini et al., 2017).

According to the aforementioned observations, the method proposed in this paper (based on the analysis of Delphi results) exploits the advantages of both approaches (mathematical and behavioural), and it can be classified as a mixed method, that is, an approach that uses and integrates multiple methods and, above all, mixes qualitative and quantitative methodologies. In studying complex phenomena, many scholars claim that a mixed-method approach is desirable, given the need to analyse the problem from many perspectives (Johnson & Onwuegbuzie, 2004; Sale et al., 2002).

Finally, an open question in the literature concerns the way in which the competence or degree of expertise of the participants can be measured and, consequently, how experts, and/or their evaluations, can be weighed (Sossa et al., 2019). Expert weighting is a continuous source of uncertainty in the aggregation of assessments, and in the methodology proposed here, we will also address the issue of how expert weights can be modelled. Although expert weights affect the results of any Delphi study, the problem of modelling them has received limited attention in the literature.

3 Aggregating expert opinions by combining mathematical and behavioural approaches

3.1 The proposed approach

Let \({X}_{ji}\) denote the assessment of item \(i=1,\dots ,I\) according to expert \(j=1,\dots ,J\). Examples of common assessments are the probability of occurrence, impact, plausibility, relevance and evolution. Generally, \({X}_{ji}\) is a (numeric) score. Sometimes, \({X}_{ji}\) is an ordered categorical variable. In this subsection, we describe a very general and flexible approach to combining expert opinions, which can easily be customized to handle the case of ordered categorical variables.

The most familiar way to combine expert opinions is by averaging \({X}_{ji}\) over the experts using the arithmetic mean (Szwed, 2016; Cooke, 1991). This approach corresponds to.

\({}_{0}^{{}} C_{i} = \frac{1}{J}\mathop \sum \limits_{j = 1}^{J} X_{ji}\).

The arithmetic mean is very simple, and this is its main advantage because it is easily comprehensible to all stakeholders. However, looking at the arithmetic mean from a different perspective, its simplicity is also its main disadvantage. In fact, the arithmetic mean is very limited because.

  1. (i)

    the experts have the same weights; and

  2. (ii)

    the item assessments \({X}_{ji}\), \(j=1,\dots ,J\) are not normalized or standardized.

Point (i) is a drawback because it is generally advisable or desirable to weight experts differently according to their expertise, usually by self-weighing, or, more generally, with performance weights resulting from a validation procedure, when possible (Cooke, 1991). Point (ii) is also a drawback because, in general, the assessment vectors of experts (\({X}_{j1},\dots ,{X}_{jI}\)) \(j=1,\dots ,J\) are not comparable due to their different locations and variability. Therefore, we do not suggest combining the raw assessments but rather adjusting them to achieve comparability. These considerations lead to.

\({}_{\left( . \right)1}^{{}} C_{i} = \mathop \sum \limits_{j = 1}^{J} w_{j} N(X_{ji} )/\mathop \sum \limits_{j = 1}^{J} w_{j}\),where \(w_{j}\) denotes the weight of expert \(j\), and \(N\left( . \right)\) denotes a normalization function. The usual constraints apply to the weights: \(w_{j} > 0\) and \(\mathop \sum \limits_{j = 1}^{J} w_{j} = 1\).

The most familiar normalization functions are

$$ N_{1} \left( {X_{ji} } \right) = \frac{{X_{ji} - min\left( {X_{j.} } \right)}}{{max\left( {X_{j.} } \right) - min\left( {X_{j.} } \right)}} $$

and.

\(N_{2} \left( {X_{ji} } \right) = \frac{{X_{ji} - mean\left( {X_{j.} } \right)}}{{sd\left( {X_{j.} } \right)}}\),where \(X_{j.} = \left( {X_{j1} ,...,X_{jI} } \right)\) is the vector of expert j’s assessments, \(mean\left( {X_{j.} } \right) = \frac{1}{I}\mathop \sum \limits_{i = 1}^{I} X_{ji}\) and \(sd\left( {X_{j.} } \right) = \sqrt {\frac{1}{I}\mathop \sum \limits_{i = 1}^{I} \left( {X_{ji} - mean\left( {X_{j.} } \right)} \right)^{2} }\). N1 is the linear scaling in the min–max range, and N2 is the z score standardizing \(X_{j.}\) values such that the adjusted values have a 0 mean and 1 standard deviation.

Other quite familiar normalization functions are.

\(N_{3} \left( {X_{ji} } \right) = \frac{{X_{ji} - median\left( {X_{j.} } \right)}}{{mad\left( {X_{j.} } \right)}}\),where \(mad\left( {X_{j.} } \right) = median\left( {\left| {X_{j.} - median\left( {X_{j.} } \right)} \right|} \right)\);

\(N_{4} \left( {X_{ji} } \right) = \mathop \sum \limits_{h = 1}^{I} \left( {X_{ji} X_{jh} } \right)\),where \(\left( {X_{ji} X_{jh} } \right) = 1\) if \(X_{ji} X_{jh}\) and 0 otherwise; and

$$ N_{5} \left( {X_{ji} } \right) = \frac{{X_{ji} }}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{I} X_{ji}^{2} } }} $$

and.

\(N_{6} \left( {X_{ji} } \right) = \frac{{X_{ji} }}{{max\left( {X_{j.} } \right)}}\) .

N3 is a robust version of N2 and is preferred to N2 when analysing heavy-tailed or highly-skewed data. N4 is the rank transformation and is another option for a robust normalization function. N5 is an example of vector normalization using the Euclidean norm. N6 is a linear scaling similar to N1. N6 is preferred to N1 when 0 adjusted values are not desirable because they can lead to computation issues in the combination step of the procedure. However, it is also possible to consider.

\(N_{7} \left( {X_{ji} } \right) = \frac{{X_{ji} - min\left( {X_{j.} } \right) + \frac{1}{I}}}{{max\left( {X_{j.} } \right) - min\left( {X_{j.} } \right) + \frac{2}{I}}}\),where \(1/I\) and \(2/I\) are added to the numerator and denominator, respectively, to avoid 0 values but retain linear scaling in the min–max range.

The aggregation of expert opinions using the arithmetic mean is equivalent to the additive rule of combination. There are other types of combination. A quite familiar one is the multiplicative rule corresponding to the geometric mean:

\({}_{\left( . \right)2}^{{}} C_{i} = \mathop \prod \limits_{j = 1}^{J} N(X_{ji} )^{{W_{j} }}\).

Note that both 1C and 2C are special case of the generalized power mean \(\left[ {\mathop \sum \limits_{j = 1}^{J} N(X_{ji} )^{p} w_{j} } \right]^{1/p}\) for p = 1 and p → 0, respectively. Other particular cases are

$$ {}_{\left( . \right)3}^{{}} C_{i} = min\left[ {N(X_{1i} ), \ldots ,N(X_{Ji} )} \right] $$

and

$$ {}_{\left( . \right)4}^{{}} C_{i} = max\left[ {N(X_{1i} ), \ldots ,N(X_{Ji} )} \right] $$

for p → -∞ and p → ∞, respectively. 4C is also called the Tippett combination. Other combination functions are.

\({}_{\left( . \right)5}^{{}} C_{i} = - \mathop \sum \limits_{j = 1}^{J} log\left( {1 - N(X_{ji} )} \right)w_{j}\),

$$ {}_{\left( . \right)6}^{{}} C_{i} = \mathop \sum \limits_{j = 1}^{J} log\left( {\frac{{N(X_{ji} )}}{{1 - N(X_{ji} )}}} \right)w_{j} $$

and.

\(_{(.)7} C_{i} = \mathop \sum \limits_{j = 1}^{J} F^{ - 1} \left( {N(X_{ji} )} \right)w_{j}\),where \(\Phi^{ - 1}\) is the quantile function of the standard normal distribution. 5C is the Fisher combining function, 6C is the logistic combining function and 7C is the Liptak combining function. 6C and 7C are particular cases of \(\mathop \sum \limits_{j = 1}^{J} F^{ - 1} \left( {N(X_{ji} )} \right)w_{j}\), where F−1 is the quantile function of a continuous random variable. It is important to note that the 4C to 7C combining functions are quite familiar within the framework of the nonparametric combination of dependent tests (see Pesarin & Salmaso, 2010).

It is important to emphasize that there is no best combining function as there is no best normalization function. Every function has advantages and disadvantages. As we have already discussed, N3 and N4 are robust against heavy-tailed and highly-skewed data, so when outliers are present. However, they have disadvantages too: N3 should not be used when the data have low variability because both the median and the median absolute deviation from the median are unstable. N4 assesses the order of the data and not the data values, leading to a loss of information that can be relevant.

We turn our attention to combining functions. The additive rule is fully compensatory because low scores are compensated by higher scores. In some fields this is preferred, an example being the labour market, where compensating differentials are typical, and workers are often offered different combinations of salary and working conditions and can decide to work very close to home in exchange for a lower salary. However, there are fields where partial compensatory combinations, like geometrics, are preferred; an example is the study of economic development, where high industrialization should not fully compensate for a high level of pollution or land use. Another important aspect when selecting the combining function is the compatibility with the normalized data. The additive combination has the advantage of being compatible with all the normalization functions. The multiplicative rule cannot be used with negative or zero values; therefore, it is not suitable for z scores N2 and N3, for example. The Fisher, logistic and Liptak rules are only compatible with linear scaling N7. It is worth noting that some of these aggregation rules, which involve direct sums, are only appropriate if all the expert opinions are positively correlated; otherwise, there may be compensation issues. However, when the method is applied, like in our application, in a Delphi context, this assumption is not strict because the aggregation method only intervenes after the Delphi rounds have concluded. This means that if the convergence of opinions has been triggered during the Delphi rounds, in the end, the experts will reach a consensus with similar values.

The problem of selecting normalization and combination functions can be addressed by borrowing uncertainty analysis from the vast literature on composite indicators. Uncertainty analysis is a Monte Carlo-based technique that allows a very flexible modelling of normalization and combination (see Saisana et al., 2005 and Marozzi, 2021). The rationale behind uncertainty analysis is based on modelling the sources of uncertainty in aggregating expert opinions. There is always a certain degree of subjectivity in selecting a particular normalization and combination function, as there is in assigning weights to experts. Another important source of uncertainty is related to expert opinion accuracy. If a best formula existed to aggregate expert opinions, then its choice would be objective. Since there is no such formula, the choice of how to combine opinions becomes subjective. Uncertainty analysis simultaneously considers different formulas to aggregate expert opinions by varying the normalization and combination functions as well as the expert weights and opinion accuracy in each iteration of the Monte Carlo procedure. Therefore, the results do not depend on the particular formula subjectively selected by the scholar; instead, they are based on a very large number of different formulas with varying normalization and combination functions and expert weights and opinion accuracy. The uncertainty algorithms used here are explained in detail in Subsections 4.2 and 4.3.

Another advantage of uncertainty analysis is that it is possible to assess the effect of the sources of uncertainty on the aggregation of expert opinions, allowing the assessment of whether the results are robust or volatile with respect to the sources of uncertainty. This is a very important point because the strength of the message that is transmitted through the data analysis depends on the robustness of the results. Let \(M\) be the number of Monte Carlo iterations. \(M\) is set at a large number like 10,000 to account for continuous sources of uncertainty (expert weights and expert opinion accuracy). Note that normalization and combinations are discrete sources of uncertainty. The output of uncertainty analysis is an \(M\times I\) matrix \(Q=\left[{q}_{mi}\right]\), whose m-th row (\({q}_{m1},\dots ,{q}_{mI}\)) contains the combined scores (after normalization) of the I items corresponding to the m-th iteration of uncertainty analysis. We would like to rank the items according to the expert assessments. Therefore, we compute the ranks of the output matrix row-wise. The resulting matrix is denoted by \(R=\left[{r}_{mi}\right]\). \(R\) contains item rankings, the rows, each of which corresponds to a particular combination of normalized scores as well as expert weights and expert opinion accuracy (as shown in Subsections 3.2 and 3.3). Column \(i\) (\({r}_{1i},\dots ,{r}_{Mi}\)) of R contains the uncertainty distribution of the rank of item \(i\), reflecting a plurality of different formulas aggregating expert opinions. Note that using the traditional approach, one obtains just one rank for item \(i\), making it impossible to assess the uncertainty and then the robustness of the result. Moreover, a unique rank value is dependent on the particular formula used to compute the combined scores. A different formula could lead to a different value. Instead, the uncertainty distribution of rank \(i\) can be summarized by computing the median, which is almost unaffected by the way expert opinions are aggregated (see Di Zio et al., 2021). The robustness of the results can be addressed by computing the 5th–95th percentile uncertainty intervals. Short intervals mean that the corresponding rank value is very stable across different aggregating formulas. Conversely, long intervals mean that the result depends on the selection of particular formulas. Markedly overlapping intervals among several item rankings mean that those ranks are rather similar.

3.2 Modelling the weights of experts

How can experts be weighted? This is an open question. Expert weighting is an important source of uncertainty when aggregating expert opinions. It is common to link expert weight to expertise, assigning different weights to experts according to their levels of expertise. There is no agreement among scholars on how to measure expertise (Sossa et al., 2019). Common methods are expertise evaluation by peers, the number of publications or citations, citation impact and the number of years spent working on the subject. Another very common situation is when experts are requested to self-evaluate their level of expertise related to each of the \(I\) items, and this leads to a vector of weights for each expert (\({w}_{j1},\dots ,{w}_{jI}\)). The vector has a length of less than \(I\) in the case that experts are requested to self-evaluate their expertise for groups of items, where different groups relate to different dimensions of the phenomenon.

A simple approach is to assign each expert a constant weight, as a summary of (\({w}_{j1},\dots ,{w}_{jI}\)), using the mean, for example:

\(w_{j} = \frac{1}{J}\mathop \sum \limits_{j = 1}^{J} w_{ji}\).

The mean can be preferred to using (\(w_{j1} , \ldots ,w_{jI}\)) directly if it is expected or assumed that experts are overly lenient in self-evaluating their expertise with regard to some items while being overly strict in self-evaluating their expertise in terms of other items, because the mean compensates lower values with higher ones. On the other hand, one may prefer to directly use \(w_{j1} , \ldots ,w_{jI}\) values. Uncertainty related to expert weights can be modelled similarly to normalization and combination using uncertainty analysis. It is generally assumed that \(w_{ji}\) values, being estimates, are affected by measurement error. Uncertainty can be modelled by applying an additive random error. In this case, we compute it in each step of the Monte Carlo procedure.

\({}_{1}^{{}} u\left( {w_{ji} } \right) = w_{ji} + \eta_{ji} , \forall j,i\),where \({\eta }_{ji}\) is a random error term. For maximum flexibility, the error term can always be the same \({\eta }_{ji}=\eta \) or be different item-wise \({\eta }_{ji}={\eta }_{i}\) if different precision is assumed in self-evaluating one’s expertise regarding different items, when, for example, some items concern much debated topics that are lacking information, while other items are simpler and regarding which knowledge is vast and consolidated. It can also be expert-wise \({\eta }_{ji}={\eta }_{j}\) if it is assumed that experts have different levels of precision in self-evaluating their expertise, or cross expert-item-wise if interaction between the expert and item factors is assumed. The most common distribution for the error is normal with a 0 mean and an \(sd\left({w}_{j.}\right)/5\) standard deviation. In the case that only a single weight value is used for each expert, the standard deviation of \({w}_{1},\dots ,{w}_{J}\) can be used in place of \(sd\left({w}_{j.}\right)\). The use of 1/5 is in accordance with general practice (see Saltelli & Saisana, 2010). Different values can be used to reflect general higher/lower precision in self-evaluations. A 0 mean reflects neutrality, while in the case that it is assumed that experts are too lenient (strict) in their self-evaluations, a negative (positive) mean is set for the normal error mean (think about, for example, the well-known cognitive bias of overconfidence). Also, distributions other than the normal one can be considered. For example, if very large or small weights are observed with respect to the others, it makes sense to use a distribution with larger than normal tails. A skewed distribution can be used in the case that there is reason to assume the error term distribution is not symmetric.

An alternative way to model uncertainty related to weights is by applying a multiplicative random error:

\({}_{2}^{{}} u\left( {w_{ji} } \right) = w_{ji\;ji} , \forall j,i\).

We suggest generating \({\beta }_{ji}\) from a beta distribution defined in \(\left[{a}_{ji},{b}_{ji}\right]\), where \(0<{a}_{ji}<{b}_{ji}\). For example, by setting \({a}_{ji}=0.75\) and \({b}_{ji}=1.25\), we can model weight uncertainty by letting weight vary within ± 25% of the original values. This approach is very flexible, like the additive one, because \({a}_{ji}\) and \({b}_{ji}\) can also be set at different values for expert, item or cross item-expert terms to reflect different precision in the self-evaluation of expertise. Moreover, if lenient (strict) self-evaluation is assumed, \({b}_{ji}<1\) (\({a}_{ji}>1\)) can be set. The beta distribution is also very flexible in terms of modelling the shape of weight distribution. In fact, by changing its two shape parameters, it is possible to generate symmetric, skewed, U-shaped, bell-shaped or J-shaped weight distributions.

3.3 Modelling the accuracy of expert opinions

The accuracy of expert opinions is another important source of uncertainty. \({X}_{j.}=({X}_{j1},...,{X}_{jI})\) is the vector of the assessments of expert \(j\), and since they are estimates, it makes sense to account for their accuracy. The rationale is similar to that behind modelling expert weight uncertainty. The most common approach is to assign an additive random error to expert assessments.

\({}_{1}^{{}} u\left( {X_{ji} } \right) = X_{ji} + \delta_{ji} , \forall j,i\),where \({\delta }_{ji}\) is a random number generated from a standard normal distribution with a 0 mean and an \(sd\left({X}_{j.}\right)/5\) standard error. Values other than 5 as the denominator of the \({\delta }_{ji}\) standard error can be used to reflect higher or lower precision in expert assessments. Positive or negative values for the mean of the error term can be used if it is assumed experts tend to overestimate or underestimate item assessments. In theory, it is also possible to assign different random error means to different item assessments if it is assumed or expected that some items are more prone than others to underestimation or overestimation. Again, it is about the effects of cognitive biases, and an example would be the anchoring bias according to which an expert assessment is influenced by a particular reference/previous estimate. Similarly, it is also possible to assign different random error means to different expert assessments if it is assumed or expected that some experts are more prone than others to underestimate or overestimate assessments. Distributions other than the normal one can be used if there is reason to reflect skewness or heavier than normal tails in addressing expert assessment uncertainty.

The framework we are presenting is very flexible, allowing one to also link expert assessment uncertainty to expert weights. Two examples are as follows:

  1. (i)

    \(sd\left( {\delta_{ji} } \right) = f\left( {w_{ji} } \right)\), where \(f\) is a non-increasing function of \({w}_{ji}\). The rationale is that error terms for experts with larger weights have smaller standard errors to reflect higher precision in item assessments due to higher expertise. On the contrary, experts with smaller weights have larger standard errors to reflect lower precision in item assessments due to lower expertise;

  2. (ii)

    \(sd\left( {\delta_{ji} } \right) = g\left( {\mathop {X_{.i} }\limits^{\prime } } \right)\), where g is a function of \(\mathop {X_{.i} }\limits^{\prime }\) the vector of expert assessments for item i within a preliminary round of the Delphi study. The rationale is to assign larger variable error terms (lower precision) to items whose assessment shows low convergence during the Delphi study and smaller variable error terms (higher precision) to items whose assessment shows high convergence. Within this logic, we associate the standard errors with the consensus degree of Delphi.

Similarly to the uncertainty related to weights, a multiplicative approach can be considered to model expert assessment uncertainty. In this case, in each step of the uncertainty analysis we compute.

\({}_{2}^{{}} u\left( {X_{ji} } \right) = X_{ji} \gamma_{ji} , \forall j,i\).

We suggest generating the multiplied error \({\gamma }_{ji}\) from the beta distribution defined in \(\left[{c}_{ji},{d}_{ji}\right]\), where \({0<c}_{ji}<{d}_{ji}\). If it is assumed there is no systematic under- or overestimation in the expert assessments of the items, set \({c}_{ji}<1\) and \({d}_{ji}>1\). For example, by setting \({c}_{ji}=0.8\) and \({d}_{ji}=1.2\), we model assessment uncertainty by letting the assessments vary within ± 20% of the original values. If underestimation is assumed, set \({c}_{ji}>1\). For example, by setting \({c}_{ji}=1.1\) and \({d}_{ji}=1.3\), we model assessment uncertainty by letting the assessments vary within + 10% and + 30% of the original values. Whereas if overestimation is assumed, setting \({d}_{ji}<1\), for example, by setting \({c}_{ji}=0.75\) and \({d}_{ji}=0.9\), we model assessment uncertainty by letting the assessments vary within -10% and -25% of the original values. As before, \({c}_{ji}\) and \({d}_{ji}\) can be also set differently expert-wise, item-wise or for both the expert and the item, to reflect different assessing accuracy by different experts or different accuracy in assessing different items or an interaction between expert and item factors. As already discussed in Subsection 3.2, the beta distribution is also very flexible in terms of modelling the shape of the expert assessment distribution.

4 An application for the development of Delphi-based scenarios

4.1 The data set

The application of the method presented in Sect. 3 is illustrated by analysing data from a Delphi study on the future of Italian families (see Bolzan, 2018). \(I=41\) items were considered. Each item is a short statement aimed at describing a specific aspect of the future of families. The items are grouped in seven different thematic areas (see Appendix A).

The Delphi study was carried out in three successive rounds through computer-assisted web interviewing (CAWI), using the open source online statistical survey web app LimeSurvey (www.limesurvey.org). Experts were asked to assess items in terms of their evolution and relevance over 10 years (the study started in 2018) using a 0–100 discrete scale in increments of 5, with scores of less than 50 indicating a decrease, scores larger than 50 showing an increase and scores equal to 50 indicating invariance with regard to evolution and relevance over the 10 years. Initially, 32 experts were involved, while \(J=30\) participated in all phases of the study, corresponding to a rather low dropout rate of around 6%. Experts were also asked to self-evaluate their expertise using a similar 0–100 discrete scale in increments of 5 regarding the 7 areas into which the items were grouped. After the last Delphi round, we aimed to combine the opinions of the experts regarding the items in order to rank them from the first to the last in terms of evolution. This ranking allows the grouping of items into a certain number of clusters, which forms the basis for the construction of future scenarios (Di Zio et al., 2021).

4.2 Robust weighted aggregation of expert opinions

In this subsection, we illustrate the practical application of the method presented in Sect. 3 to the data set described in Subsection 4.1. More precisely, we would like to combine expert opinions on item evolution. Among the combining functions listed in Subsection 3.1, we selected the additive function

$$ {}_{\left( . \right)1}^{{}} C_{i} = \mathop \sum \limits_{j = 1}^{J} w_{j} N(X_{ji} )/\mathop \sum \limits_{j = 1}^{J} w_{j} $$

and the multiplicative function.

\({}_{\left( . \right)2}^{{}} C_{i} = \mathop \prod \limits_{j = 1}^{J} N(X_{ji} )^{{W_{j} }}\),while from the normalization functions, we chose the z score.

\(N_{2} \left( {X_{ji} } \right) = \frac{{X_{ji} - mean\left( {X_{j.} } \right)}}{{sd\left( {X_{j.} } \right)}}\),the rank transformation

$$ N_{4} \left( {X_{ji} } \right) = \mathop \sum \limits_{h = 1}^{I} \left( {X_{ji} X_{jh} } \right) $$

and the linear scaling.

\(N_{6} \left( {X_{ji} } \right) = \frac{{X_{ji} }}{{max\left( {X_{j.} } \right)}}\) .

These selections are both objective and subjective. For example, z score N1 is preferred to its robust version N3 because the item scores are clustered in rather few different values, leading to not very informative values for the median and the median absolute deviation from the median. We could also have picked N7 in place of N6, but we preferred N6 because it was simpler.

Algorithm 1 was run for the first application based on the following steps:

figure a

4.3 Results

The results are displayed in Fig. 1. Dots represent the median ranks; vertical segments represent the 5th–95th percentile uncertainty intervals of the ranks. Items are ranked from the one with the highest median rank in terms of evolution (item 31) to the one with the lowest median rank regarding evolution (item 7).

Fig. 1
figure 1

Robust ranking of items based on their evolution

From Fig. 1, it can be observed that the items located in the extreme positions are characterized by lower uncertainty, while the items that occupy the central positions are characterized by greater (although limited) uncertainty. The reasons may lie in the complexity of the topics that the items refer to, without excluding other sources, such as the way in which the concepts have been expressed in the questionnaire, which could result in difficulties in terms of understanding.

The items showing greater uncertainty—in the central part of the graph—refer to the following areas of the questionnaire: parents, housing and policy and services (see Appendix A). These fields can be identified as being associated with the “public sphere” of the family. Conversely, greater robustness of the rank estimates is observed for the items of the following areas: spouses, extended family, children, family models, communication and solidarity. These are issues regarding which the opinions of experts are expressed more immediately, and they recall the idea of the “private sphere” of the family.

The two items showing the least robustness are item 6 (parents will invest in their role as educators of their children) and item 11 (young people will tend to remain in their families of origin once they find employment). Already in the previous phases of the research (see Di Zio et al., 2021), the indicators of the Delphi process had revealed how the experts had experienced some difficulties in terms of reaching a consensus on these two items. One possible explanation is given by the fact that these are issues—particularly in Italy—regarding which family ties are very strong, even in young people who have completed periods of study far from their families of origin (Ambrosini & Rosina, 2009; Bolzan, 2018). In addition, the issue of raising children in Italy still has an inhomogeneous vision due to the strong disparity in actions and interests between the two parents. Therefore, these are two highly debated issues for which it is difficult to reach a consensus, and consequently, the possible future developments are very uncertain.

The previous algorithm was re-run by changing the standard deviation of \(\delta \) in Step 5.4 to show the effect of different levels of precision in the expert assessments of the items. More precisely, we set \(sd\left( X \right)/2\) in case (ii) and \(sd\left( X \right)\) in case (iii). Therefore, we model the expert assessment accuracy with lower precision with respect to case (i), where \(sd\left( X \right)/5\) was used. Table 1 reports the results for cases (i) to (iii), along with those of case (iv), which will be discussed later.

Table 1 5th, 50th and 95th percentiles of item rank uncertainty distributions for evolution

The results are very interesting. There are very clear patterns in the uncertainty interval extremes as well as in the uncertainty interval lengths item-wise. In particular, we emphasize that.

  • The median ranks are mostly the same across cases (i) to (iii) (and (iv), as discussed later);

  • The 5th percentiles tend to decrease as the precision in expert assessments decreases;

  • The 95th percentiles tend to increase as the precision in expert assessments decreases; and

  • The uncertainty interval lengths increase as the precision decreases.

The results for cases (i) to (iii) show that the rankings of items according to expert opinions are most unaffected by assessment precision, whereas rank uncertainty is markedly affected, with higher precision leading to lower uncertainty. The results also show that the method is very useful for what-if analyses. In fact, the researcher can easily change the various settings of the previous algorithm to find out the impact on the results, as we have done here to show the effect of different levels of precision in expert assessments.

We will now describe a fourth application, denoted by case (iv), of the method. We aim to showcase the flexibility of the method. Algorithm 2 was run for case (iv) based on the following steps:

figure b

With respect to cases (i) to (iii), in case (iv), the precision of the expert assessments is modelled in a much more general manner. The standard error for the item scoring is constant in cases (i) to (iii), whereas both the numerator and denominator of the standard error for the item scoring are not constant. More precisely, the numerator changes as the expert changes; the denominator depends on one’s self-evaluated expertise in terms of the classes of items (higher self-evaluated expertise, higher assessment precision, and vice versa) and therefore on changes based on the interaction between the experts and items. In addition, Table 1 shows that for case (iv), the median ranks are mostly the same as before. The level of uncertainty, as indicated by the 5th–95th percentile interval length, is intermediate compared to those of cases (ii) and (iii).

Many other additional analyses were performed. The weight of the experts was also modelled using the multiplicative approach. Many beta distributions with very different shapes were considered along with many different settings for density support. These additional results are not reported because they are very consistent with those corresponding to the additive modelling of experts’ weights.

We emphasize, on the one hand, the subjectivity of algorithm setting options, that is the flexibility of the method, which allows the incorporation of the preferences of the researcher (regarding errors, weights …); on the other hand, we have the robustness of the results (median ranks are almost unaffected by the algorithm settings) in addition to the information on the uncertainty, which changes according to the choices made by the researcher. This uncertainty can then become an important input for further analyses and, for example, for constructing future scenarios on families; hence, this is proof that even though the median ranks remain constant, it is necessary to address result uncertainty.

4.4 Future scenario development

As mentioned in Subsection 4.1, the experts evaluated 41 items (i.e. 41 Delphi future projections) in terms of their relevance and evolution. The approach used for the development of the future scenarios is that of Delphi-based scenarios (Di Zio et al., 2021; von der Gracht & Darkow, 2010), where Delphi outputs are grouped in clusters representing draft scenarios. According to our proposal, before grouping the items, it is necessary to create rankings, built as described above, to consider all the sources of uncertainty that occur in a Delphi elicitation.

In many classical methods, the Delphi results are aggregated by choosing one formula for normalization and one for aggregation, but in this way, the resulting future scenarios are heavily dependent on these subjective choices. In our proposal, on the other hand, the results are robust, precisely because from a Monte Carlo simulation perspective, different formulas are considered simultaneously.

The aggregated judgments, in the form of rankings on evolution and relevance, were used as input in a fuzzy c-means clustering algorithm. This is a non-hierarchical clustering technique (Josien & Liao, 2000) that enables the detection of potential overlaps of items across scenarios. The resulting clusters were then refined by a panel of experts to assess their plausibility and consistency (Di Zio et al., 2021).

For an improved definition of the scenarios in future developments of this approach, it might be useful to consider some measures of dependence among items to ensure the selection of a subset of objects with which most of the information is associated.

5 Conclusion

The main contribution of the method proposed in this paper concerns the integration of both the mathematical and behavioural approaches, exploiting some advantages of both while limiting the disadvantages. By using multiple methods and mixing qualitative and quantitative methodologies, this approach can be included in the mixed-methods category.

It has been shown that uncertainty analysis does not require subjectively selecting a specific mathematical formula to combine expert scores, a normalization formula and a particular weight rule. This is an important point because there is no optimal formula for aggregation and for normalization, nor an optimal weight rule. It has also been shown that uncertainty analysis can model expertise as well as expert scoring accuracy. Another advantage of the proposed method is that its results can be disclosed in a very simple way with an interval graph and are therefore easy to understand, even for stakeholders without mathematical skills.

It has been discussed that a limitation, rather than a disadvantage, of the method is that in its current form, it can be applied to item ranking but not to their ratings. The method is also conceptually applicable with regard to generating ratings, but with far lower flexibility; in fact, ratings are not always comparable when the mathematical formula used varies, unlike rankings, which are always comparable as they are ordinal numbers. An open problem is the full integration of the various qualitative phases within the proposed method. In fact, in this paper, while we also illustrate how it may be possible to integrate the initial phases of Delphi as well, only the results of the last Delphi round are used as inputs for the algorithm. It would be of great interest to explore how both the rankings and the uncertainty intervals change over the course of the different Delphi rounds. Presumably, since the ranks are robust, they may vary slightly, but the variations in the uncertainty intervals throughout the rounds would provide very useful information about the dynamics of consensus development among experts.

The what-if analyses mentioned above could be very useful for the construction of future scenarios. In fact, by modulating the various settings of the algorithm(s) differently, the distinct results in terms of uncertainty intervals provide valuable material for the experts and for the research team in the subsequent steps of evaluation and the refinement of the future scenarios.

Finally, we would like to point out that in future developments of the proposed method it would be useful to take into account the assignment of different weights not only to the experts but also to the items, according to the possible different objectives of the study in question.