Abstract
In many situations we are interested in appraising the value of a certain characteristic for a given individual relative to the context in which this value is observed. In recent years this problem has become prominent in the evaluation of scientific productivity and impact. A popular approach to such relative valuations consists in using percentile ranks. This is a purely ordinal method that may sometimes lead to counterintuitive appraisals, in that it discards all information about the distance between the raw values within a given context. By contrast, this information is partly preserved by using standardization, i.e., by transforming the absolute values in such a way that, within the same context, the distance between the relative values is monotonically related to the distance between the absolute ones. While there are many practically useful alternatives for standardizing a given characteristic across different contexts, the general problem seems to have never been addressed from a theoretical and normative viewpoint. The main aim of this paper is to fill this gap and provide a conceptual framework that allows for this kind of systematic investigation. We then use this framework to prove that, under some rather weak assumptions, the general format of a standardization function can be determined quite sharply.
Similar content being viewed by others
Notes
It is not our intention, in this paper, to take a stand on the controversial issue concerning the impact of bibliometric analysis itself, especially when applied to the evaluation of individuals, on the overall quality of research. We only aim at a methodological contribution that is neutral with respect to the different policies that may be adopted to promote the growth of high quality scientific knowledge. However, our analysis may also shed some light on the whole problem and be used to support empirical analysis or simulations of different policies.
Recall that the z-score of \(w_{i}\) with respect to \({\mathbf {w}} = (w_{1},\ldots ,w_{n})\) is equal to \(\frac{w_{i}-\mu ({\mathbf {w}})}{\sigma ({\mathbf {w}})}\), where \(\mu ({\mathbf {w}})\) and \(\sigma ({\mathbf {w}})\) are the mean and the standard deviation of \({\mathbf {w}}\). In the example the z-scores of \(y_{10}\) and \(x_{11}\) are respectively 0.5497 and 2.0948.
Depending on the choice of statistics, it may be the case that \(\varOmega ^{*}\) cannot coincide with \(\varOmega\). For example, if our standardization function is the z-score or the max-min, \(\varOmega ^{*}\) cannot include vectors whose components are all equal, such as (3, 3, 3, 3), for in this case the standard deviation is 0, and so the standardization function would require dividing by 0.
Here by “projection” we mean any mapping from a vector to its value occupying a given position. So, the jth projection map \(\textit{proj}_{j}\) is defined as the function mapping each vector \({\mathbf {x }}\) containing at least j elements on its value \(x_{j}\).
Recall we assume that \(\varPhi\) is finite.
For technical convenience, we assume that the standardization function is mathematically well-defined even when the first argument is an arbitrary real value that does not belong to the context in the second argument, as is the case with the usual standardization functions. The same assumption is made for Property A4.
Statistics are usually classified into three general classes, that is, location statistics (e.g., mean, median, mode, quantiles, minimum and maximum), dispersion statistics (e.g., variance, standard deviation, range, interquartile range), and shape statistics (e.g., skewness, kurtosis). In our terminology, the class of dispersion statistics includes also that of the shape statistics.
Recall that \(\varPhi\) is assumed to be non-redundant.
References
Abramo, G., Cicero, T., & D’Angelo, C. (2012). How important is choice of the scaling factor in standardizing citations? Journal of Informetrics, 6(4), 645–654.
Albarrán, P., Crespo, J., Ortuño, I., & Ruiz-Castillo, (2011). The skewness of science in 219 sub-fields and a number of aggregates. Scientometrics, 88(2), 385–397.
Kaplan, R., & Saccuzzo, D. (2013). Psychological Testing. Principles, Applications and Issues. Belmont: Wadsworth Publishing.
Kindlund, S. (2005) A method to standardize usability metrics into a single score. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 401–409). New York: ACM.
Larose, D., & Larose, C. (2015). Data mining and predictive analysis. Hoboken: Wiley.
Leydesdorff, L., Bornmann, L., Mutz, R., & Opthof, T. (2011). Turning the tables on citation analysis one more time: Principles for comparing sets of documents. Journal of the American Society for Information Science and Technology, 62, 1370–1381.
Lezak, M. (1995). Neuropsychological Assessment. Newe York and Oxford: Oxford University Press.
Li, Y., Radicchi, F., Castellano, C., & Ruiz-Castillo, J. (2013). Quantitative evaluation of alternative field normalization procedures. Journal of Informetrics, 7, 746–755.
Lundberg, J. (2007). Lifting the crown-citation z-score. Journal of Informetrics, 1(2), 145–154.
Milligan, G., & Cooper, M. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.
Mlachil M, Tapsoba R, Tapsoba, S. (2014) A quality of growth index for developing countries: a proposal. IMF Working Paper (WP/14/172)
Moed, H. (2010). Cwts crown indicator measures citation impact of a research group’s publication oeuvre. Journal of Informetrics, 4, 436438.
OECD. (2005). Handbook on constructing composite indicators: Methodology and user guide. Paris: OECD Publishing.
Radicchi, F., Fortunato, S., & Castellano, C. (2008). Universality of citation distributions: Toward an objective measure of scientific impact. Proceedings of the National Academy of Sciences of the United States of America, 105(45), 17268–17272.
Roberts, A., & Varberg, D. (1973). Convex Functions. New York and London: Academic Press.
Stoddard, A. (1979). Standardization of measures prior to cluster analysis. Biometrics, 35, 765–773.
Tijssen, R., Visser, M., & Van Leeuwen, T. (2002). Benchmarking international scientific excellence: Are highly cited research papers an appropriate frame of reference? Scientometrics, 54, 381–397.
Tullis, T., & Albert, B. (2013). Measuring the user experience. Collecting, analysing, and presenting usability metrics. Amsterdam: Elsevier.
Van Leeuwen, T., Visser, M., Moed, H. F., Nederhof, T. J., & Van Raan, A. F. (2003). The holy grail of science policy: Exploring and combining bibliometric tools in search of scientific excellence. Scientometrics, 57, 257–280.
Van Raan, A., Van Leeuwen, T., Visser, N., Van Eck, M. S., & Waltman, L. (2010). Rivals for the crown: Reply to opthof and leydesdorff. Journal of Informetrics, 4, 431–435.
Vinkler, P. (2012). The case of scientometricians with the absolute relative impact indicator. Journal of Informetrics, 6, 254–264.
Waltman, L. (2016). A review of the literature on citation impact indicators. Journal of Informetrics, 10, 365–391.
Waltman, L., & Schreiber, M. (2013). On the calculation of percentile-based bibliometric indicators. Journal of the American Society for Information Science and Technology, 64, 372–379.
Waltman, L., Van Eck, N., Van Leeuwen, T., Visser, M., & Van Raan, A. (2011). Towards a new crown indicator: Some theoretical considerations. Journal of Informetrics, 5, 37–47.
Wang, Y., & Chen, H. J. (2012). Use or percentiles and z-scores in anthropometry. Handbook of anthropometry (pp. 29–48). New York: Springer. Physical Measures of Human Form in Health and Disease.
Zhang, Z., Cheng, Y., & Liu, N. C. (2014). Comparison of the effect of mean-based method and z-score for field normalization of citations at the level of web of science subject categories. Scientometrics, 101(3), 1679–1693.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Lemma 1
Suppose ab absurdo that \(\varPhi\) is redundant on \(\varOmega _{1}\). By definition, we know that there exists a proper subset \(\varPsi\) of \(\varPhi\) such that \(\varPhi\) and \(\varPsi\) are equivalent on \(\varOmega _{1}\). Particularly, this means that \(g({\mathbf {x}})=g({\mathbf {y}})\) for all \(g\in \varPhi\) whenever \({\mathbf {x}},{\mathbf {y}}\in \varOmega _1\) are such that \(f(\mathbf {x})=f(\mathbf {y})\) for all \(f\in \varPsi\). However, being \(\varPhi\) non-redundant on \(\varOmega ^{*}\) by assumption, there exist \({\mathbf {x}}_0,{\mathbf {y}}_0\in \varOmega ^{*}\) and a \(g\in \varPhi \setminus \varPsi\) such that \(f({\mathbf {x}}_0)=f({\mathbf {y}}_0)\) for all \(f\in \varPsi\), but \(g({\mathbf {x}}_0)\ne g({\mathbf {y}}_0)\). Since \({\mathbf {x}}_0,{\mathbf {y}}_0\in \varOmega _{1}\), this clearly contradicts the above property, so closing the proof.
Proof of Theorem 1
Let \((S,\varOmega ^{*},\varPhi , D)\) be a standardization set-up such that all the functions in \(\varPhi\) are location or dispersion statistics and positively homogeneous, with \(\varPhi \supseteq \{f,g\}\).
Observe that it follows from clause A4 in Definition 1 that for all \(u,v\in {\mathbb {R}}\) and all \({\mathbf {x}} \in \varOmega ^{*}\),
Moreover, \(u - (u-v)\) has the same sign as \((u+v) - u\). Applying Property A2, it follows that \(S(u+v,{\mathbf {x}}) - S(u,{\mathbf {x}})\) also has the same sign as \(S(u,{\mathbf {x}})- S(u-v,{\mathbf {x}})\) and so \(S(u+v,{\mathbf {x}}) - S(u,{\mathbf {x}})\) must be equal to \(S(u,{\mathbf {x}})- S(u-v,{\mathbf {x}})\). Notice also that, within a given equivalence class \(H\in \varOmega ^{*}/\sim _{\varPhi }\), the standardization function S is independent of its second argument (by A2), since the value of S is preserved under substitutions of the vector in the second argument with an equivalent one. So, let \(S_{H}: {\mathbb {R}} \rightarrow {\mathbb {R}}\) be the one-argument function defined as follows: \(S_{H}(u) = z\) if and only if \(S(u,{\mathbf {w}}) = z\) for \({\mathbf {w}} \in H\). Then the above equation can be written as:
for all \(u,v\in {\mathbb {R}}\). By an elementary algebraic manipulation, such property reads as
The above equation means that both \(S_H\) and \(-S_H\) are midconvex on \({\mathbb {R}}\), and recalling that, being \(S_H\) monotone, it is also measurable, as a consequence of a result proved by Blumberg, and independently by Sierpinski (see, for instance, Roberts and Varberg (1973)), it follows that \(S_{H}\) is an affine function. Consequently, there exist real constants \(a_{H}\) and \(b_{H}\), depending only on H, such that, for all \(u\in {\mathbb {R}}\):
Generalizing the previous argument, we can always associate with any \({\mathbf {x}}\in \varOmega ^{*}\) two real constants, denoted by \(a_{[{\mathbf {x}}]}\,\) and \(b_{[{\mathbf {x}}]}\), depending only on \([{\mathbf {x}}]\), such that, for all \(u\in {\mathbb {R}}\):
Notice that \(a_{[{\mathbf {x}}]}\) is forced to be positive by A2. We now distinguish the following two cases: (1) \(\varPhi\) contains at least one dispersion statistic and (2) \(\varPhi\) contains no dispersion statistic.
Case 1. \(\varPhi \supseteq \{f,g\}\) contains a dispersion statistic, say g. Given any \({\mathbf {x}}\in \varOmega ^{*}\), denote by \(\underline{b}_{[{\mathbf {x}}]}\) the vector of E given by \(\underline{b}_{[{\mathbf {x}}]}=(b_{[{\mathbf {x}}]},\ldots ,b_{[{\mathbf {x}}]})\). Then, recalling that \(\bar{S}({{\mathbf {x}}})\) is an abbreviation for \((S(x_{1},{{\mathbf {x}}}), \ldots , S(x_{n},{{\mathbf {x}}}))\) for some \(n\in \mathbb {N}\), it follows from (16), the definition of dispersion statistic and the positive homogeneity of g that there exists an \(\alpha >0\) such that:
At the same time, since \(\bar{S}({\mathbf {x}})\in D\), owing to A1 we deduce that \(g\big (\bar{S}({\mathbf {x}}) \big )=c_g\), which, combined with Eq. (17), leads to
We assert that \(c_g\ne 0\): otherwise, by Eq. (18), recalling that \(a_{[{\mathbf {x}}]}>0\) for all \({\mathbf {x}}\in \varOmega ^{*}\), we would obtain that g is identically zero on \(\varOmega ^{*}\). In this case, it is easy to see that \(\varPhi\) and \(\varPhi \setminus \{g\}\) are equivalent on \(\varOmega ^{*}\), so contradicting the requirement that \(\varPhi\) is non-redundant in Definition 1 and proving the assertion. As a straightforward consequence of the assertion and Eq. (18), one finds that \(g({\mathbf {x}})\ne 0\) for all \({\mathbf {x}}\in \varOmega ^{*}\), showing that \(\varOmega ^{*}\) is a subset of \(\{{\mathbf {x}}\in \varOmega :g({\mathbf {x}})\ne 0\}\).
Now, observe that, if \(\varPhi\) contains another dispersion statistic, say \(g^{\prime }\), then, by Eq. (18), it is not difficult to show that there exists an \(\alpha ^{\prime }>0\) such that \(g^{\prime }({\mathbf {x}}) = c\cdot (g({\mathbf {x}}))^{\alpha ^{\prime }/\alpha }\) for all \({\mathbf {x}}\in \varOmega ^{*}\), where \(c=c_{g^{\prime }}\cdot (c_g)^{-\alpha ^{\prime }/\alpha }\). Consequently, \(\varPhi\) and \(\varPhi \setminus \{g^{\prime }\}\) are clearly equivalent on \(\varOmega ^{*}\) against the requirement that \(\varPhi\) is non-redundant. Therefore, \(\varPhi\) can contain at most one dispersion statistic.
Since, by assumption, \(\varPhi \supseteq \{f,g\}\) and all the functions in \(\varPhi\) are location or dispersion statistics, it follows that f is a location statistic. Repeating the argument illustrated before Eqs. (17) and (18), just replacing the definition of dispersion statistic with the one of location statistic, and recalling Remark 3, we deduce that:
Hence, resorting to Eqs. (18) and (19) and after a simple algebraic manipulation, Eq. (16) boils down to Eq. (3). Further, \(\varPhi\) cannot contain any other location statistic, say \(f'\); for, applying again Eq. (19) with \(f^{\prime }\) in place of f, we obtain that \(f'({\mathbf {x}}) = F(f({\mathbf {x}}),g({\mathbf {x}}))\), where \(F(u,v)=u+c v^{1/\alpha }\), with \(c=(c_{f^{\prime }}-c_f)\cdot c_g^{-1/\alpha }\). Thus, owing to Remark 1, \(\varPhi\) contains a redundant set of statistics given by \(\{f,g,f^{\prime }\}\), against the assumption that \(\varPhi\) is non-redundant. This implies that \(\varPhi = \{f,g\}\) and S must have the form stated in Eq. (3).
Finally, recalling Remark 2, it is now absolutely clear that \(\varOmega ^{*}\) is maximal if and only if it coincides with the whole set \(\{{\mathbf {x}}\in \varOmega :g({\mathbf {x}})\ne 0\}\), so closing the case of the presence of a dispersion statistic in \(\varPhi\).
Case 2: \(\varPhi \supseteq \{f,g\}\) contains no dispersion statistic. Then, by assumption, all the statistics in \(\varPhi\) are location statistics, particularly \(f\,\) and g. Now, exploiting Eq. (19), we obtain that
Repeating the same argument for g, we have that:
If we insert Eq. (20) into Eq. (21), we get:
We claim that \(c_g-c_f\ne 0\): otherwise, by Eq. (22), recalling that \(a_{[{\mathbf {x}}]}>0\) for all \({\mathbf {x}}\in \varOmega ^{*}\), we would derive that \(f\equiv g\) on \(\varOmega ^{*}\), so clearly contradicting the requirement that \(\varPhi\) is non-redundant in Definition 1 and proving the claim. As a straightforward consequence of the claim and Eq. (22), one finds that \(g({\mathbf {x}})-f({\mathbf {x}})\ne 0\) for all \({\mathbf {x}}\in \varOmega ^{*}\), showing that \(\varOmega ^{*}\) is a subset of \(\{{\mathbf {x}}\in \varOmega :g({\mathbf {x}})-f({\mathbf {x}})\ne 0\}\).
Now, resorting to Eqs. (22) and (20) and after a simple algebraic manipulation, it is easy to check that Eq. (16) boils down to Eq. (4). Further, \(\varPhi\) cannot contain any other location statistic, say \(f'\); for, applying again Eq. (19), with \(f^{\prime }\) in place of f, and Eq. (20), we obtain that \(f'({\mathbf {x}}) = F(f({\mathbf {x}}),g({\mathbf {x}}))\), where \(F(u,v)=c_1 u+c_2 v\), with
Thus, owing to Remark 1, \(\varPhi\) contains a redundant set of statistics given by \(\{f,g,f^{\prime }\}\), against the assumption that \(\varPhi\) is non-redundant. Therefore, \(\varPhi = \{f,g\}\) and S must have the form stated in Eq. (4).
Finally, recalling Remark 2, it is now absolutely clear that \(\varOmega ^{*}\) is maximal if and only if it coincides with the whole set \(\{\mathbf {x}\in \varOmega :g(\mathbf {x})-f(\mathbf {x})\ne 0\}\), so closing this case.
This concludes the proof of the theorem.
Proof of Corollary 1
By Case 1 of Theorem 1, after the assignment \(p({\mathbf {x}}):=\underline{b}_{[{\mathbf {x}}]}\), we immediately deduce that S satisfies Eq. (5) and that \(\varOmega ^{*}\subseteq \{{\mathbf {x}}\in \varOmega : f({\mathbf {x}})\ne 0\}\). Note that C1 is a straightforward consequence of the fact that \(\underline{b}_{[{\mathbf {x}}]}=\underline{b}_{[{\mathbf {y}}]}\) whenever \({\mathbf {x}} \sim _{\varPhi } {\mathbf {y}}\), i.e. \(\,f({\mathbf {x}})=f({\mathbf {y}})\). Finally, C2 directly stems from condition A3.
Proof of Lemma 2
Suppose ab absurdo that such an r exists. Let \({\mathbf {a}}\in E\): then, by definition of location statistic, we have that \(f({\mathbf {r}}+{\mathbf {a}})=f({\mathbf {r}})+a\) and, at the same time, \(f({\mathbf {a}}+{\mathbf {r}})= f({\mathbf {a}})+r\). Thus, exploiting the assumption \(f({\mathbf {r}})=r\), we derive that \(f({\mathbf {a}})=a\) for any \({\mathbf {a}}\in E\). Now, fix any \({\mathbf {x}}\notin E\): by assumption, we have that \(f({\mathbf {x}})=r-h\) for some \(h\ne 0\). Then, fixing \({\mathbf {h}}=(h,\ldots ,h)\in E\), we get \(f({\mathbf {x}}+{\mathbf {h}})=f({\mathbf {x}})+h=r\), hence \({\mathbf {x}}+{\mathbf {h}}\in f^{-1}(\{r\})\), which is a contradiction, because \({\mathbf {x}}+{\mathbf {h}}\) is evidently different from \({\mathbf {r}}\), since it neither belongs to E.
Proof of Corollary 2
By virtue of Eq. (20), after the assignment \(p({\mathbf{x}}):=\underline{a}_{[\mathbf {x}]}\), we immediately deduce that S satisfies Eq. (6). Note that C1 is a straightforward consequence of the fact that \(\underline{a}_{[{\mathbf {x}}]}=\underline{a}_{[{\mathbf {y}}]}\) whenever \({\mathbf {x}} \sim _{\varPhi } {\mathbf {y}}\), i.e. \(\,f({\mathbf {x}})=f({\mathbf {y}})\). Condition C3 is due to the fact that, as recalled in the proof of Theorem 1, \(\underline{a}_{[{\mathbf {x}}]}\) is forced to be positive for any \(x\in \varOmega ^{*}\) by A2. Finally, as direct consequence of the previous lemma, we know that there exists at least a \({\mathbf {x}}\in D\) such that \(x_i\ne c_f\) for some i. Thus, since condition A3 applied to Eq. (6) leads to \(p({\mathbf {x}})(x_i-c_f)=x_i-c_f\), C4 easily follows.
Rights and permissions
About this article
Cite this article
D’Agostino, M., Dardanoni, V. & Ricci, R.G. How to standardize (if you must). Scientometrics 113, 825–843 (2017). https://doi.org/10.1007/s11192-017-2495-7
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-017-2495-7