Searching for a source of difference in graphical models

doi:10.1016/j.jmva.2022.104973

Journal of Multivariate Analysis

Volume 190, July 2022, 104973

https://doi.org/10.1016/j.jmva.2022.104973 Get rights and content

Abstract

We look at a two-sample problem within the framework of decomposable graphical models. When the global hypothesis of equality of two distributions is rejected, the interest is usually in localizing the source of difference. Motivated by the idea that diseases can be seen as system perturbations, and by the need to distinguish between the origin of perturbation and components affected by the perturbation, we introduce the concept of a minimal seed set, and its graphical counterpart a graphical seed set. They intuitively consist of variables driving the difference between the two conditions. We propose a simple testing procedure, linear in the number of nodes, to estimate the graphical seed set from data. We illustrate our approach in the context of gene set analysis, where we show that is possible to zoom in on the origin of perturbation in a gene network.

Introduction

The present work is motivated by the problem of identifying the origin of perturbation in gene regulatory networks. In biological networks, diseases can be modeled as perturbations that affect certain targets, which, once perturbed, propagate the perturbation through network connections [6]. In practice, we often collect and compare observations from healthy individuals and observations from patients after the disease related perturbation has already taken place. On the basis of this comparison, it is of interest to identify the site of original perturbation, i.e., the source of difference, and distinguish it from the elements of the network that were affected through the process of network propagation.

Let $F = {P_{θ}; θ \in Θ}$ , $Θ \subset R^{d}$ , be a family, parametrized by $θ$ , of probability distributions for the random vector $X_{V}$ , indexed by a set $V$ , $| V | = p$ , with support $X_{V}$ . In what follows, to unburden the notation and when no ambiguity can arise, we adopt the notation of [5] and, allowing for a slight abuse of notation, we write $θ$ instead of $P_{θ}$ to denote individual distributions belonging to $F$ . For $A, B \subseteq V$ , we will further write $θ_{A}$ to denote (the parameters of) the marginal distribution of variables in $A$ and, similarly, $θ_{A | B}$ to denote a collection of conditional distributions $\{θ_{A ∣ X_{B} = y}, y \in X_{B}\}$ indexed by $y$ , where $X_{B}, B \subseteq V$ , is a subvector of $X_{V}$ and $X_{B}$ is the associated support. Different experimental conditions will be distinguished by use of superscripts.

Consider a random vector $X_{V} \sim P_{θ}$ . Within the context of two sample problems, the interest is often in testing the null hypothesis of equality of distributions $H_{0} : θ^{(1)} = θ^{(2)}$ . If that hypothesis is rejected, one usually aims at localizing the source of difference.

A common approach to tackle the question in genomics applications is to focus on the $p$ univariate marginal distributions, see for instance [16] for a particularly popular method choice. Marginally speaking, a variable $X_{v}$ , $v \in V$ , can be considered relevant to the aim at hand if its marginal distribution is different in $P_{θ^{(1)}}$ and $P_{θ^{(2)}}$ .

The (index) set of the relevant variables is then taken to be $R = \{v \in V : θ_{v}^{(1)} \neq θ_{v}^{(2)}\} .$ Whether a variable belongs to $R$ depends solely on its marginal distribution.

Although simple and computationally feasible, the marginal approach might fail to point to the true source of difference whenever an interplay between variables plays a role in differentiating the two distributions [10]. In that case, we propose to privilege a conditional perspective and exploit an approach which takes into account the entire $p$ -dimensional joint distribution and flags a variable relevant only if the difference in its marginal distribution cannot be explained by the remaining variables. We define the set of conditionally relevant variables $D$ as follows.

Definition 1 Seed Set

Consider $θ^{(1)}, θ^{(2)} \in F$ . We call the set $D \subseteq V$ the seed set, if the collections of conditional laws $θ_{V ∖ D ∣ D}^{(1)}$ and $θ_{V ∖ D ∣ D}^{(2)}$ coincide. Furthermore, we say that $D$ is a minimal seed set, if no proper subset of it is itself a seed set.

To facilitate the understanding of the above definition, it is helpful to consider that, by employing the factorization $p (x; θ) = p (x_{D}; θ_{D}) p (x_{\bar{D}} ∣ x_{D}; θ_{\bar{D} ∣ D})$ , where $\bar{D} = V ∖ D$ , the likelihood ratio $p (x; θ^{(1)}) / p (x; θ^{(2)})$ simplifies to $p (x_{D}; θ_{D}^{(1)}) / p (x_{D}; θ_{D}^{(2)})$ . The likelihood ratio thus depends only on variables in $D$ . When comparing the two distributions, the variables outside of $D$ are either irrelevant or redundant and $D$ can be seen as the minimal subset of variables explaining the difference between the two distributions. It should be stressed that there is no relation between $R$ and $D$ ; in general neither $R \subseteq D$ nor $D \subseteq R$ .

In practice, to identify the seed set, $D$ needs to be estimated from data. One could perform a number of tests of equality of conditional distributions, but when $p$ is large, this testing problem becomes extremely challenging, and represents an open area of research, see for instance [25] and references therein. In this paper, we assume that the dependence structure among the $p$ variables in the joint distribution can be well represented by an undirected graph. We then address the problem of identifying $D$ within the framework of graphical models, where we exploit the structural modularity of decomposable graphical models [5], [8]. To this aim, we assume that $F$ is a strong meta Markov model with respect to a given undirected decomposable graph $G = (V, E)$ , where $E \subseteq V \times V$ is a set of edges. Let us denote by $M (G)$ a family of distributions satisfying the global Markov property relative to $G$ . According to the definition introduced by [5], $F \subseteq M (G)$ is a strong meta Markov model if for any decomposition ( $A, B$ ) of $G$ , parameters $θ_{A}$ and $θ_{B ∣ A}$ are variation independent in $F$ [2, p.26]. In other words, all possible values of $θ_{A}$ are logically compatible with all possible values of $θ_{B ∣ A}$ .

Under this assumption, there is a close relationship between the parametric model structure and the underlying graph, and we show that the problem of identifying $D$ can be formulated as the problem of testing equality of lower dimensional conditional distributions induced by the structure of $G$ . We further show that the associated test statistics are functions of the quantities pertaining to the lower dimensional marginal distributions. The key advantage is that inference on marginal distributions is significantly less challenging than inference on conditional distributions. Beside the computational gain, we argue that the proposed approach addresses the issue of exploiting information on the structure of dependence in an efficient and elegant way.

Section snippets

Decomposition of the global hypothesis of equality of two Markov distributions

A major appeal of decomposable graphs in graphical modeling is that they allow for a clique-grained decomposition of the statistical model. Let $C_{1}, \dots, C_{k}$ be a sequence of cliques of $G$ satisfying a running intersection property (see Section 1 of Supplementary material), and let $S_{2}, \dots, S_{k}$ be an associated sequence of (possibly non-unique) separators. Then, if the distribution of $X_{V}$ is Markov relative to $G$ , its joint distribution decomposes as: $p (x_{V}) = p (x_{C_{1}}) \prod_{j = 2}^{k} p (x_{R_{j}} ∣ x_{S_{j}}),$ where $R_{j} = C_{j} ∖ S_{j}$ , $j \in \{2, \dots, k\}$ .

The graphical seed set

Before we show how the result of the previous section can be used to make inference about the seed set, we need to introduce the concept of the graphical seed set. Namely, by employing a clique-grained decomposition, we are not always able to identify the minimal seed set; in those cases we can identify its superset that we denote by $D_{G}$ . Relation between the two sets, that depends on both $D$ and $G$ , is the subject of this section.

Definition 2 Graphical Seed Set

Let $D$ be a minimal seed set for $θ^{(1)}$ and $θ^{(2)}$ , two graphical

Simulation study 1

To study the finite sample behavior of ${\hat{D}}_{G}$ , we considered a randomly generated graph $G$ consisting of 100 nodes grouped in 37 cliques (the largest clique containing 15 nodes). The code to reproduce all numerical experiments, as well as real data analysis featured in Section 5, is available at https://github.com/veradjordjilovic/Seed-set. A plot of the graph is shown in Fig. 5 in Supplementary Material. The minimal seed set was set to $D = \{2, 5\}$ . In the chosen graph, the graphical seed set does not

Biological validation

Genes and gene products cluster into functionally connected pathways, i.e. networks of biological interactions that describe their basic dynamics [12]. A large literature has developed around the problem of detecting statistically significant dysregulations of pathways in different experimental conditions [9], [11], [21], but translating detected dysregulations into claims about their origin is a challenging task. Chromosomal rearrangements offer a possible explanation. Chromosome

Discussion

Two sample testing problem we consider is closely related to the problem of variable selection in a logistic regression. When a predictor is a $p$ -dimensional random vector $X$ and the output is a class label (1 or 2), the minimal seed set coincides with the Markov blanket of the response.

Modularity of graphical modes is usually considered with regards to density factorization or parameter estimation. Theorem 1 mirrors this property in the hypothesis testing setting within the framework of strong

CRediT authorship contribution statement

Vera Djordjilović: Conceptualization, Methodology, Software, Formal analysis, Writing – original draft, Writing – review & editing. Monica Chiogna: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Supervision, Project administration.

Acknowledgments

We thank the Editor, Associate Editor and the Referees for useful comments and suggestions on an earlier version of the manuscript, which led to this improved version. Insight and expertise that greatly contributed to the development of this paper and the research behind was generously received by a number of colleagues, most notably Chiara Romualdi, Maria Sofia Massa and Elisa Salviato.

References (25)

Del SolA. et al.
Diseases as network perturbations
Curr. Opin. Biotechnol.
(2010)
AndersonT.W.
An Introduction to Multivariate Statistical Analysis
(2003)
Barndorff-NielsenO.
Information and Exponential Families in Statistical Theory
(2014)
CapitanioA. et al.
Graphical models for skew-normal variates
Scand. J. Stat.
(2003)
ChiarettiS. et al.
Gene expression profiles of B-lineage adult acute lymphocytic leukemia reveal genetic patterns that identify lineage derivation and distinct mechanisms of transformation
Clin. Cancer Res.
(2005)
DawidA. et al.
Hyper Markov laws in the statistical analysis of decomposable graphical models
Ann. Statist.
(1993)
DethlefsenC. et al.
A common platform for graphical models in R: The gRbase package
J. Stat. Softw.
(2005)
FrydenbergM. et al.
Decomposition of maximum likelihood in mixed graphical interaction models
Biometrika
(1989)
GoemanJ.J. et al.
A global test for groups of genes: testing association with a clinical outcome
Bioinformatics
(2004)
HudsonN.J. et al.
A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation
PLoS Comput. Biol.
(2009)

HummelM. et al.

GlobalANCOVA: exploration and assessment of gene group effects

Bioinformatics

(2008)

KanehisaM. et al.

KEGG: Kyoto Encyclopedia of Genes and Genomes

Nucleic Acids Res.

(2000)

Cited by (1)

A Bartlett-type correction for likelihood ratio tests with application to testing equality of Gaussian graphical models
2023, Statistics and Probability Letters

View full text

Searching for a source of difference in graphical models

Abstract

Introduction

Section snippets

Decomposition of the global hypothesis of equality of two Markov distributions

The graphical seed set

Simulation study 1

Biological validation

Discussion

CRediT authorship contribution statement

Acknowledgments

Curr. Opin. Biotechnol.

An Introduction to Multivariate Statistical Analysis

Information and Exponential Families in Statistical Theory

Graphical models for skew-normal variates

Scand. J. Stat.

Gene expression profiles of B-lineage adult acute lymphocytic leukemia reveal genetic patterns that identify lineage derivation and distinct mechanisms of transformation

Clin. Cancer Res.

Hyper Markov laws in the statistical analysis of decomposable graphical models

Ann. Statist.

A common platform for graphical models in R: The gRbase package

J. Stat. Softw.

Decomposition of maximum likelihood in mixed graphical interaction models

Biometrika

A global test for groups of genes: testing association with a clinical outcome

Bioinformatics

A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation

PLoS Comput. Biol.

GlobalANCOVA: exploration and assessment of gene group effects

Bioinformatics

KEGG: Kyoto Encyclopedia of Genes and Genomes

Nucleic Acids Res.