Searching for a source of difference in graphical models

https://doi.org/10.1016/j.jmva.2022.104973Get rights and content

Abstract

We look at a two-sample problem within the framework of decomposable graphical models. When the global hypothesis of equality of two distributions is rejected, the interest is usually in localizing the source of difference. Motivated by the idea that diseases can be seen as system perturbations, and by the need to distinguish between the origin of perturbation and components affected by the perturbation, we introduce the concept of a minimal seed set, and its graphical counterpart a graphical seed set. They intuitively consist of variables driving the difference between the two conditions. We propose a simple testing procedure, linear in the number of nodes, to estimate the graphical seed set from data. We illustrate our approach in the context of gene set analysis, where we show that is possible to zoom in on the origin of perturbation in a gene network.

Introduction

The present work is motivated by the problem of identifying the origin of perturbation in gene regulatory networks. In biological networks, diseases can be modeled as perturbations that affect certain targets, which, once perturbed, propagate the perturbation through network connections [6]. In practice, we often collect and compare observations from healthy individuals and observations from patients after the disease related perturbation has already taken place. On the basis of this comparison, it is of interest to identify the site of original perturbation, i.e., the source of difference, and distinguish it from the elements of the network that were affected through the process of network propagation.

Let F={Pθ;θΘ}, ΘRd, be a family, parametrized by θ, of probability distributions for the random vector XV, indexed by a set V, |V|=p, with support XV. In what follows, to unburden the notation and when no ambiguity can arise, we adopt the notation of [5] and, allowing for a slight abuse of notation, we write θ instead of Pθ to denote individual distributions belonging to F. For A,BV, we will further write θA to denote (the parameters of) the marginal distribution of variables in A and, similarly, θA|B to denote a collection of conditional distributions θAXB=y,yXB indexed by y, where XB,BV, is a subvector of XV and XB is the associated support. Different experimental conditions will be distinguished by use of superscripts.

Consider a random vector XVPθ. Within the context of two sample problems, the interest is often in testing the null hypothesis of equality of distributions H0:θ(1)=θ(2). If that hypothesis is rejected, one usually aims at localizing the source of difference.

A common approach to tackle the question in genomics applications is to focus on the p univariate marginal distributions, see for instance [16] for a particularly popular method choice. Marginally speaking, a variable Xv, vV, can be considered relevant to the aim at hand if its marginal distribution is different in Pθ(1) and Pθ(2).

The (index) set of the relevant variables is then taken to be R=vV:θv(1)θv(2).Whether a variable belongs to R depends solely on its marginal distribution.

Although simple and computationally feasible, the marginal approach might fail to point to the true source of difference whenever an interplay between variables plays a role in differentiating the two distributions [10]. In that case, we propose to privilege a conditional perspective and exploit an approach which takes into account the entire p-dimensional joint distribution and flags a variable relevant only if the difference in its marginal distribution cannot be explained by the remaining variables. We define the set of conditionally relevant variables D as follows.

Definition 1 Seed Set

Consider θ(1),θ(2)F. We call the set DV the seed set, if the collections of conditional laws θVDD(1) and θVDD(2) coincide. Furthermore, we say that D is a minimal seed set, if no proper subset of it is itself a seed set.

To facilitate the understanding of the above definition, it is helpful to consider that, by employing the factorization p(x;θ)=p(xD;θD)p(xD̄xD;θD̄D), where D̄=VD, the likelihood ratio p(x;θ(1))/p(x;θ(2)) simplifies to p(xD;θD(1))/p(xD;θD(2)). The likelihood ratio thus depends only on variables in D. When comparing the two distributions, the variables outside of D are either irrelevant or redundant and D can be seen as the minimal subset of variables explaining the difference between the two distributions. It should be stressed that there is no relation between R and D; in general neither RD nor DR.

In practice, to identify the seed set, D needs to be estimated from data. One could perform a number of tests of equality of conditional distributions, but when p is large, this testing problem becomes extremely challenging, and represents an open area of research, see for instance [25] and references therein. In this paper, we assume that the dependence structure among the p variables in the joint distribution can be well represented by an undirected graph. We then address the problem of identifying D within the framework of graphical models, where we exploit the structural modularity of decomposable graphical models [5], [8]. To this aim, we assume that F is a strong meta Markov model with respect to a given undirected decomposable graph G=(V,E), where EV×V is a set of edges. Let us denote by M(G) a family of distributions satisfying the global Markov property relative to G. According to the definition introduced by [5], FM(G) is a strong meta Markov model if for any decomposition (A,B) of G, parameters θA and θBA are variation independent in F [2, p.26]. In other words, all possible values of θA are logically compatible with all possible values of θBA.

Under this assumption, there is a close relationship between the parametric model structure and the underlying graph, and we show that the problem of identifying D can be formulated as the problem of testing equality of lower dimensional conditional distributions induced by the structure of G. We further show that the associated test statistics are functions of the quantities pertaining to the lower dimensional marginal distributions. The key advantage is that inference on marginal distributions is significantly less challenging than inference on conditional distributions. Beside the computational gain, we argue that the proposed approach addresses the issue of exploiting information on the structure of dependence in an efficient and elegant way.

Section snippets

Decomposition of the global hypothesis of equality of two Markov distributions

A major appeal of decomposable graphs in graphical modeling is that they allow for a clique-grained decomposition of the statistical model. Let C1,,Ck be a sequence of cliques of G satisfying a running intersection property (see Section 1 of Supplementary material), and let S2,,Sk be an associated sequence of (possibly non-unique) separators. Then, if the distribution of XV is Markov relative to G, its joint distribution decomposes as: p(xV)=p(xC1)j=2kp(xRjxSj),where Rj=CjSj, j2,,k.

The graphical seed set

Before we show how the result of the previous section can be used to make inference about the seed set, we need to introduce the concept of the graphical seed set. Namely, by employing a clique-grained decomposition, we are not always able to identify the minimal seed set; in those cases we can identify its superset that we denote by DG. Relation between the two sets, that depends on both D and G, is the subject of this section.

Definition 2 Graphical Seed Set

Let D be a minimal seed set for θ(1) and θ(2), two graphical

Simulation study 1

To study the finite sample behavior of DˆG, we considered a randomly generated graph G consisting of 100 nodes grouped in 37 cliques (the largest clique containing 15 nodes). The code to reproduce all numerical experiments, as well as real data analysis featured in Section 5, is available at https://github.com/veradjordjilovic/Seed-set. A plot of the graph is shown in Fig. 5 in Supplementary Material. The minimal seed set was set to D=2,5. In the chosen graph, the graphical seed set does not

Biological validation

Genes and gene products cluster into functionally connected pathways, i.e. networks of biological interactions that describe their basic dynamics [12]. A large literature has developed around the problem of detecting statistically significant dysregulations of pathways in different experimental conditions [9], [11], [21], but translating detected dysregulations into claims about their origin is a challenging task. Chromosomal rearrangements offer a possible explanation. Chromosome

Discussion

Two sample testing problem we consider is closely related to the problem of variable selection in a logistic regression. When a predictor is a p-dimensional random vector X and the output is a class label (1 or 2), the minimal seed set coincides with the Markov blanket of the response.

Modularity of graphical modes is usually considered with regards to density factorization or parameter estimation. Theorem 1 mirrors this property in the hypothesis testing setting within the framework of strong

CRediT authorship contribution statement

Vera Djordjilović: Conceptualization, Methodology, Software, Formal analysis, Writing – original draft, Writing – review & editing. Monica Chiogna: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Supervision, Project administration.

Acknowledgments

We thank the Editor, Associate Editor and the Referees for useful comments and suggestions on an earlier version of the manuscript, which led to this improved version. Insight and expertise that greatly contributed to the development of this paper and the research behind was generously received by a number of colleagues, most notably Chiara Romualdi, Maria Sofia Massa and Elisa Salviato.

References (25)

  • Del SolA. et al.

    Diseases as network perturbations

    Curr. Opin. Biotechnol.

    (2010)
  • AndersonT.W.

    An Introduction to Multivariate Statistical Analysis

    (2003)
  • Barndorff-NielsenO.

    Information and Exponential Families in Statistical Theory

    (2014)
  • CapitanioA. et al.

    Graphical models for skew-normal variates

    Scand. J. Stat.

    (2003)
  • ChiarettiS. et al.

    Gene expression profiles of B-lineage adult acute lymphocytic leukemia reveal genetic patterns that identify lineage derivation and distinct mechanisms of transformation

    Clin. Cancer Res.

    (2005)
  • DawidA. et al.

    Hyper Markov laws in the statistical analysis of decomposable graphical models

    Ann. Statist.

    (1993)
  • DethlefsenC. et al.

    A common platform for graphical models in R: The gRbase package

    J. Stat. Softw.

    (2005)
  • FrydenbergM. et al.

    Decomposition of maximum likelihood in mixed graphical interaction models

    Biometrika

    (1989)
  • GoemanJ.J. et al.

    A global test for groups of genes: testing association with a clinical outcome

    Bioinformatics

    (2004)
  • HudsonN.J. et al.

    A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation

    PLoS Comput. Biol.

    (2009)
  • HummelM. et al.

    GlobalANCOVA: exploration and assessment of gene group effects

    Bioinformatics

    (2008)
  • KanehisaM. et al.

    KEGG: Kyoto Encyclopedia of Genes and Genomes

    Nucleic Acids Res.

    (2000)
  • Cited by (1)

    View full text