Searching for a source of difference in graphical models
Introduction
The present work is motivated by the problem of identifying the origin of perturbation in gene regulatory networks. In biological networks, diseases can be modeled as perturbations that affect certain targets, which, once perturbed, propagate the perturbation through network connections [6]. In practice, we often collect and compare observations from healthy individuals and observations from patients after the disease related perturbation has already taken place. On the basis of this comparison, it is of interest to identify the site of original perturbation, i.e., the source of difference, and distinguish it from the elements of the network that were affected through the process of network propagation.
Let , , be a family, parametrized by , of probability distributions for the random vector , indexed by a set , , with support . In what follows, to unburden the notation and when no ambiguity can arise, we adopt the notation of [5] and, allowing for a slight abuse of notation, we write instead of to denote individual distributions belonging to . For , we will further write to denote (the parameters of) the marginal distribution of variables in and, similarly, to denote a collection of conditional distributions indexed by , where , is a subvector of and is the associated support. Different experimental conditions will be distinguished by use of superscripts.
Consider a random vector . Within the context of two sample problems, the interest is often in testing the null hypothesis of equality of distributions . If that hypothesis is rejected, one usually aims at localizing the source of difference.
A common approach to tackle the question in genomics applications is to focus on the univariate marginal distributions, see for instance [16] for a particularly popular method choice. Marginally speaking, a variable , , can be considered relevant to the aim at hand if its marginal distribution is different in and .
The (index) set of the relevant variables is then taken to be Whether a variable belongs to depends solely on its marginal distribution.
Although simple and computationally feasible, the marginal approach might fail to point to the true source of difference whenever an interplay between variables plays a role in differentiating the two distributions [10]. In that case, we propose to privilege a conditional perspective and exploit an approach which takes into account the entire -dimensional joint distribution and flags a variable relevant only if the difference in its marginal distribution cannot be explained by the remaining variables. We define the set of conditionally relevant variables as follows.
Definition 1 Seed Set Consider . We call the set the seed set, if the collections of conditional laws and coincide. Furthermore, we say that is a minimal seed set, if no proper subset of it is itself a seed set.
To facilitate the understanding of the above definition, it is helpful to consider that, by employing the factorization , where , the likelihood ratio simplifies to . The likelihood ratio thus depends only on variables in . When comparing the two distributions, the variables outside of are either irrelevant or redundant and can be seen as the minimal subset of variables explaining the difference between the two distributions. It should be stressed that there is no relation between and ; in general neither nor .
In practice, to identify the seed set, needs to be estimated from data. One could perform a number of tests of equality of conditional distributions, but when is large, this testing problem becomes extremely challenging, and represents an open area of research, see for instance [25] and references therein. In this paper, we assume that the dependence structure among the variables in the joint distribution can be well represented by an undirected graph. We then address the problem of identifying within the framework of graphical models, where we exploit the structural modularity of decomposable graphical models [5], [8]. To this aim, we assume that is a strong meta Markov model with respect to a given undirected decomposable graph , where is a set of edges. Let us denote by a family of distributions satisfying the global Markov property relative to . According to the definition introduced by [5], is a strong meta Markov model if for any decomposition () of , parameters and are variation independent in [2, p.26]. In other words, all possible values of are logically compatible with all possible values of .
Under this assumption, there is a close relationship between the parametric model structure and the underlying graph, and we show that the problem of identifying can be formulated as the problem of testing equality of lower dimensional conditional distributions induced by the structure of . We further show that the associated test statistics are functions of the quantities pertaining to the lower dimensional marginal distributions. The key advantage is that inference on marginal distributions is significantly less challenging than inference on conditional distributions. Beside the computational gain, we argue that the proposed approach addresses the issue of exploiting information on the structure of dependence in an efficient and elegant way.
Section snippets
Decomposition of the global hypothesis of equality of two Markov distributions
A major appeal of decomposable graphs in graphical modeling is that they allow for a clique-grained decomposition of the statistical model. Let be a sequence of cliques of satisfying a running intersection property (see Section 1 of Supplementary material), and let be an associated sequence of (possibly non-unique) separators. Then, if the distribution of is Markov relative to , its joint distribution decomposes as: where , .
The graphical seed set
Before we show how the result of the previous section can be used to make inference about the seed set, we need to introduce the concept of the graphical seed set. Namely, by employing a clique-grained decomposition, we are not always able to identify the minimal seed set; in those cases we can identify its superset that we denote by . Relation between the two sets, that depends on both and , is the subject of this section.
Definition 2 Graphical Seed Set Let be a minimal seed set for and , two graphical
Simulation study 1
To study the finite sample behavior of , we considered a randomly generated graph consisting of 100 nodes grouped in 37 cliques (the largest clique containing 15 nodes). The code to reproduce all numerical experiments, as well as real data analysis featured in Section 5, is available at https://github.com/veradjordjilovic/Seed-set. A plot of the graph is shown in Fig. 5 in Supplementary Material. The minimal seed set was set to . In the chosen graph, the graphical seed set does not
Biological validation
Genes and gene products cluster into functionally connected pathways, i.e. networks of biological interactions that describe their basic dynamics [12]. A large literature has developed around the problem of detecting statistically significant dysregulations of pathways in different experimental conditions [9], [11], [21], but translating detected dysregulations into claims about their origin is a challenging task. Chromosomal rearrangements offer a possible explanation. Chromosome
Discussion
Two sample testing problem we consider is closely related to the problem of variable selection in a logistic regression. When a predictor is a -dimensional random vector and the output is a class label (1 or 2), the minimal seed set coincides with the Markov blanket of the response.
Modularity of graphical modes is usually considered with regards to density factorization or parameter estimation. Theorem 1 mirrors this property in the hypothesis testing setting within the framework of strong
CRediT authorship contribution statement
Vera Djordjilović: Conceptualization, Methodology, Software, Formal analysis, Writing – original draft, Writing – review & editing. Monica Chiogna: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Supervision, Project administration.
Acknowledgments
We thank the Editor, Associate Editor and the Referees for useful comments and suggestions on an earlier version of the manuscript, which led to this improved version. Insight and expertise that greatly contributed to the development of this paper and the research behind was generously received by a number of colleagues, most notably Chiara Romualdi, Maria Sofia Massa and Elisa Salviato.
References (25)
- et al.
Diseases as network perturbations
Curr. Opin. Biotechnol.
(2010) An Introduction to Multivariate Statistical Analysis
(2003)Information and Exponential Families in Statistical Theory
(2014)- et al.
Graphical models for skew-normal variates
Scand. J. Stat.
(2003) - et al.
Gene expression profiles of B-lineage adult acute lymphocytic leukemia reveal genetic patterns that identify lineage derivation and distinct mechanisms of transformation
Clin. Cancer Res.
(2005) - et al.
Hyper Markov laws in the statistical analysis of decomposable graphical models
Ann. Statist.
(1993) - et al.
A common platform for graphical models in R: The gRbase package
J. Stat. Softw.
(2005) - et al.
Decomposition of maximum likelihood in mixed graphical interaction models
Biometrika
(1989) - et al.
A global test for groups of genes: testing association with a clinical outcome
Bioinformatics
(2004) - et al.
A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation
PLoS Comput. Biol.
(2009)
GlobalANCOVA: exploration and assessment of gene group effects
Bioinformatics
KEGG: Kyoto Encyclopedia of Genes and Genomes
Nucleic Acids Res.
Cited by (1)
A Bartlett-type correction for likelihood ratio tests with application to testing equality of Gaussian graphical models
2023, Statistics and Probability Letters