1 Introduction

Clustering ensemble methods look for consensus solutions from a set of base clustering algorithms, thus trying to combine into a single partition the information present in many different ones. Several authors have shown that these methods tend to reveal more robust and stable cluster structures than the individual clusterings in the ensemble (Fred 2001; Fred and Jain 2002; Strehl and Ghosh 2003). Leveraging an ensemble of clusterings is considerably more difficult than combining an ensemble of classifiers, due to the label correspondence problem: how to put in correspondence the cluster labels produced by different clustering algorithms? This problem is made more serious if clusterings with different numbers of clusters are allowed in the ensemble.

A possible solution to sidestep the cluster label correspondence problem has been proposed in the Evidence Accumulation Clustering (EAC) framework (Fred and Jain 2005). The core idea is based on the assumption that similar data points are very likely grouped together by some clustering algorithm and, conversely, data points that co-occur very often in the same cluster should be regarded as being very similar. Hence, it is reasonable to summarize a clustering ensemble in terms of a pair-wise similarity matrix, called co-association matrix, where each entry counts the number of clusterings in the ensemble in which a given pair of data points is placed in the same cluster. This new mapping can then be used as input for any similarity-based clustering algorithm. In Fred and Jain (2005), agglomerative hierarchical algorithms are used to extract the consensus partition (e.g. Single Link, Average Link, or Ward’s Link). In Fred and Jain (2006), an extension is proposed, entitled Multi-Criteria Evidence Accumulation Clustering (Multi-EAC), filtering the cluster combination process using a cluster stability criterion. Instead of using the information of the different partitions, it is assumed that, since algorithms can have different levels of performance in different regions of the space, only certain clusters should be considered.

The way the co-association matrix is exploited in the literature is very naïve. Indeed, standard approaches based on EAC simply run a generic pairwise clustering algorithm with the co-association matrix as input. The underlying clustering criteria of ad hoc algorithms, however, do not take advantage of the statistical interpretation of the computed similarities, which is an intrinsic part of the EAC framework. Also, the direct application of a clustering algorithm to the co-association matrix typically induces a hard partition of the data. Although having crisp partitions as baseline for the accumulation of evidence of data organization is reasonable, this assumption is too restrictive in the phase of producing a consensus clustering. Indeed, the consensus partition is a solution that tries to accommodate the different clusterings in the ensemble and by allowing soft assignments of data points to clusters we can preserve some information about their intrinsic variability and capture the level of uncertainty of the overall label assignments, which would not be detected in the case of hard consensus partitions. The variability in the clustering solution of the ensemble might depend not only on the different algorithms and parametrizations adopted to build the ensemble, but also on the presence of clusters that naturally overlap in the data. This is the case for many important applications such as clustering micro-array gene expression data, text categorization, perceptual grouping, labelling of visual scenes and medical diagnosis. In these cases, having a consensus solution in terms of a soft partition allows to detect also overlapping clusters charactering the data. It is worth mentioning that the importance of dealing with overlapping clusters has been recognized long ago (Jardine and Sibson 1968) and, in the machine learning community, there has been a renewed interest around this problem (Banerjee et al. 2005a; Heller and Ghahramani 2007). As an alternative, the consensus extraction could be obtained by running fuzzy k-medoids (Mei and Chen 2010) on the co-association matrix as it were a standard similarity matrix, or fuzzy k-means (Bezdek 1981) by interpreting each row of the co-association matrix as a feature vector. However, such solutions would not take into account the underlying probabilistic meaning of the co-association matrix and lack any formal statistical support.

In this paper, we propose a consensus clustering approach which is based on the EAC paradigm. Our solution fully exploits the nature of the co-association matrix and does not lead to crisp partitions, as opposed to the standard approaches in the literature. Indeed, it consists of a model in which data points are probabilistically assigned a cluster. Moreover, each entry of the co-association matrix, which is derived from the ensemble, is regarded as a realization of a Binomial random variable, parametrized by the unknown cluster assignments, that counts the number of times two specific data points are expected to be clustered together. A consensus clustering is then obtained by means of a maximum likelihood estimation of the unknown probabilistic cluster assignments. We further show that this approach is equivalent to minimizing the Kullback-Leibler (KL) divergence between the observed co-occurrence frequencies derived from the co-association matrix and the co-occurrence probabilities parametrizing the Binomial random variables. By replacing the KL-divergence with any Bregman divergence, we come up with a more general formulation for consensus clustering. In particular we consider, as an additional example, the case where the squared Euclidean distance is used as divergence. We also propose an optimization algorithm to solve the minimization problem derived from our formulation, which works for any double-convex Bregman divergence, and a comprehensive set of experiments shows the effectiveness of our new consensus clustering approach.

The remainder of the paper is organized as follows. In Sect. 2 we provide definitions and notations that will be used across the manuscript. In Sects. 3 and 4, we describe the proposed formulation for consensus clustering and the corresponding optimization problem. In Sect. 5, we present an optimization algorithm the can be used to find a consensus solution. Section 6 briefly reviews related work, and Sect. 7 reports experimental results. Finally, Sect. 8 presents some concluding remarks. A preliminary version of this paper appeared in Rota Bulò et al. (2010).

2 Notation and definitions

Sets are denoted by upper-case calligraphic letters (e.g., \(\mathcal {O}\), \(\mathcal {E}\), …) except for \(\mathbb{R}\) and \(\mathbb{R}_{+}\) which represent as usual the sets of real numbers and non-negative real numbers, respectively. The cardinality of a finite set is written as |⋅|. We denote vectors with lower-case boldface letters (e.g., x, y, …) and matrices with upper-case boldface letters (e.g., X, Y, …). The ith component of a vector x is denoted as x i and the (i,j)th component of a matrix Y is written as y ij . The transposition operator is given by the symbol . The p -norm of a vector x is written as ∥x p and we implicitly assume a 2 (or Euclidean) norm, where p is omitted. We denote by e n a n-dimensional column vector of all 1’s and by \(\mathbf {e}_{n}^{(j)}\) the jth column of the n-dimensional identity matrix. The trace of matrix \(\mathbf {M}\in\mathbb{R}^{n\times n}\) is given by \(\operatorname {Tr}(\mathbf {M})=\sum_{i=1}^{n}m_{ii}\). The domain of a function f is denoted by dom(f) and is the indicator function giving 1 if P is true, 0 otherwise.

A probability distribution over a finite set {1,…,K} is an element of the standard simplex Δ K , which is defined as

$$\varDelta _K = \bigl\{\boldsymbol {x}\in\mathbb{R}_+^K:\Vert \mathbf {x} \Vert_1=1\bigr\}. $$

The support σ(x) of a probability distribution xΔ K is the set of indices corresponding to positive components of x, i.e.,

$$\sigma(\mathbf {x})= \bigl\{i\in\{1,\dots,K\}:x_i>0 \bigr\}. $$

The entropy of a probability distribution xΔ K is given by

$$H(\mathbf {x})=-\sum_{j=1}^Kx_j \log(x_j) $$

and the Kullback-Leibler divergence between two distributions x,yΔ K is given by

$$D_{\mathit{KL}}(\mathbf {x}\Vert \mathbf {y})=\sum_{j=1}^K x_j\log \biggl(\frac{x_j}{y_j} \biggr), $$

where we assume log0≡−∞ and 0log0≡0.

Given a continuously-differentiable, real-valued and strictly convex function \(\phi:\varDelta _{K}\!\rightarrow\!\mathbb {R}\), we denote by

$$B_\phi(\mathbf {x} \| \mathbf {y})=\phi(\mathbf {x})-\phi(\mathbf {y})-(\mathbf {x}-\mathbf {y})^\top \nabla \phi(\mathbf {y}) $$

the Bregman divergence associated with ϕ for points x,yΔ K , where ∇ is the gradient operator. By construction, the Bregman divergence is convex in its first argument. If convexity holds also for the second one, then we say that the Bregman divergence is double-convex.

The Kullback-Leibler divergence is a special case of double-convex Bregman divergence, which is obtained by considering ϕ(x)=−H(x). In Table 1 we report other examples of double-convex Bregman divergences.

Table 1 Examples of double-convex Bregman divergences

3 A probabilistic model for consensus clustering

Consensus clustering is an unsupervised learning approach that summarizes an ensemble of partitions obtained from a set of base clustering algorithms into a single consensus partition. In this section, we introduce a novel model for consensus clustering, which collapses the information gathered from the clustering ensemble into a single partition, in which data points are assigned to clusters in a probabilistic sense.

Let \(\mathcal {O}=\{1,\dots,n\}\) be the indices of a set of data points to be clustered and let \(\mathcal {E}=\{p_{u}\}_{u=1}^{N}\) be a clustering ensemble, i.e., a set of N clusterings obtained by different algorithms with possibly different parametrizations and/or initializations and/or sub-sampled versions of the data set. Each clustering \(p_{u} \in \mathcal {E}\) is a function \(p_{u}:\mathcal {O}_{u} \rightarrow \{1,\ldots,K_{u}\}\) assigning a cluster out of K u available ones to data points in \(\mathcal {O}_{u}\subseteq \mathcal {O}\), where \(\mathcal {O}_{u}\) and K u can be different across the clusterings indexed by u. We put forward data sub-sampling as a most general framework for the following reasons: it favours the diversity of the clustering ensemble and it models situations of distributed clustering where local clusters have only partial access to the data.

Since each clustering in the ensemble may stem from a sub-sampled version of the original dataset, some pairs of data points may not appear in all clusterings. Let Ω ij ⊆{1,…,N} denote the set of indices of clusterings in the ensemble where both data points i and j appear, i.e., \((u\in \varOmega_{ij}) \Leftrightarrow (\{i,j\}\subseteq \mathcal {O}_{u} )\), and let N ij =|Ω ij | denote its cardinality. Clearly, Ω ij =Ω ji and consequently N ij =N ji for all pairs (i,j) of data points. The ensemble of clusterings is summarized in the co-association matrix C=[c ij ]∈{0,…,N}n×n, where

is the number of times i and j are co-clustered in the ensemble \(\mathcal{E}\); of course, 0≤c ij N ij N and c ij =c ji .

In standard EAC literature the co-association matrix holds the fraction of times two data points are co-clustered, while in our definition it holds the number of times this event occurs. The reason of this choice stems from the fact that we allow subsampling in the ensemble construction. Consequently, the number of times two data points appear in a clustering of the ensemble is not constant over all possible pairs. This renders the observation of the fraction of times co-clustering occurs statistically more significant for some pairs of data points and less for other ones. By considering C in absolute terms and by keeping track of the quantities N ij ’s we can capture this information. As an example, consider that the ensemble consists on 100 partitions, and due to subsampling, let a pair of samples (i,j), co-appear in partitions N ij =80, and be co-clustered 70 times. Then, N=100, N ij =80 and c ij =70.

Our model assumes that each data point has an unknown probability of being assigned to each cluster. We denote by y i =(y 1i ,…,y Ki )Δ K the probability distribution over the set of K clusters {1,…,K} which characterizes data point \(i\in \mathcal {O}\), i.e., \(y_{ki} = \mathbb{P}[ i \in \mathcal {C} _{k} ]\), where \(\mathcal {C} _{k}\) denotes the subset of \(\mathcal {O}\) that constitutes the k-th cluster. The model parameter K should not be understood as the desired number of clusters but rather as a maximum number of clusters. Without prior knowledge, K might coincide with the number of data points, i.e., K=n. Finally, we store all the y i ’s in a K×n matrix \(\mathbf {Y}=[\mathbf {y}_{1},\dots,\mathbf {y}_{n}]\in \varDelta _{K}^{n}\). The probability that data points i and j are co-clustered is thus

$$\sum_{k=1}^K \mathbb{P}[i \in \mathcal {C} _k, j \in \mathcal {C} _k] = \sum _{k=1}^K \mathbb{P}[i \in \mathcal {C} _k] \mathbb{P}[j \in \mathcal {C} _k] = \sum_{k=1}^K y_{ki} y_{kj} = \mathbf {y}_i^\top \mathbf {y}_j. $$

Let C ij , i<j, be a binomial random variable representing the number of times that data points i and j are co-clustered; from the assumptions above we have that \(C_{ij} \sim \mbox{Binomial} ( N_{ij} , \mathbf {y}_{i}^{\top} \mathbf {y}_{j} )\), that is

$$\mathbb{P} [ C_{ij} = c|{\mathbf {y}_i,\mathbf {y}_j} ] = \binom{N_{ij}}{c} \bigl( {\mathbf {y}_i^\top \mathbf {y}_j} \bigr)^c \bigl( 1- {\mathbf {y}_i^\top \mathbf {y}_j} \bigr)^{N_{ij} - c}. $$

Each element c ij , i<j, of the co-associaton matrix C, is interpreted as a sample of the random variable C ij and due to the symmetry of C, entries c ij and c ji are considered as the same sample. The model considers the different C ij ’s independent. This simplification is essential, in practice, because by decoupling the pairwise, or higher order, correlations present in the consensus the likelihood becomes more tractable. Consequently, the probability of observing C, given the cluster probabilities Y, is given by

$$\mathbb{P}[\mathbf {C}\mid \mathbf {Y}] = \prod_{ \{i,j\}\in \mathcal {P} } \binom{N_{ij}}{c_{ij}} \bigl(\mathbf {y}_i^\top \mathbf {y}_j\bigr)^{c_{ij}} \bigl(1- \mathbf {y}_i^\top \mathbf {y}_j\bigr)^{N_{ij}-c_{ij}} $$

where \(\mathcal {P} =\{ \{i,j\}\subseteq \mathcal {O}: i\neq j\}\) is the set of all distinct pairs of data points. Since we consider the observations c ij and c ji as being identical due to the symmetry of C, the product is taken over the set of distinct unordered pairs of data points.

We can now estimate the unknown cluster assignments by maximizing the log-likelihood \(\log\mathbb{P}[\mathbf {C}|\mathbf {Y}]\) with respect to Y, which is given by

$$\log\mathbb{P}[\mathbf {C}|\mathbf {Y}] = \sum_{ \{i,j\}\in \mathcal {P} } \log \binom{N_{ij}}{c_{ij}}+c_{ij}\log \bigl(\mathbf {y}_i^\top \mathbf {y}_j\bigr)+(N_{ij}-c_{ij})\log \bigl(1-\mathbf {y}_i ^\top \mathbf {y}_j\bigr). $$

This yields the following maximization problem, where terms not depending on Y have been dropped:

$$ \mathbf {Y}^*\in \mathop {\mathrm {arg\,max}}\limits_{\mathbf {Y}\in \varDelta _K^n} \biggl\{\sum_{\{i,j\}\in \mathcal {P} }c_{ij} \log \bigl( \mathbf {y}_i^\top \mathbf {y}_j \bigr)+(N_{ij}-c_{ij}) \log \bigl( 1- \mathbf {y}_i^\top \mathbf {y}_j \bigr) \biggr\}. $$
(1)

Matrix Y , the solution of problem (1), provides probabilistic cluster assignments for the data points, which constitute the solution to the consensus clustering problem according to our model.

In Table 2 we summarize the notation introduced in this section.

Table 2 Summary of notation

4 A class of alternative formulations

The formulation introduced in the previous section for consensus clustering can be seen as a special instance of a more general setting, which will be described in this section.

Let \(\psi:\mathbb {R}\rightarrow\mathbb {R}^{2}\) be a function mapping a scalar to a 2-dimensional vector defined as ψ(x)=(x,1−x) and let \(d_{\phi}:\mathbb {R}\times\mathbb {R}\rightarrow\mathbb {R}\) be given as follows:

$$d_\phi(x_1,x_2)=B_\phi\bigl( \psi(x_1)\Vert\psi(x_2)\bigr). $$

Consider now the following class of formulations for consensus clustering, which is parametrized by a continuously-differentiable and strictly convex function \(\phi:\varDelta _{2}\rightarrow\mathbb {R}\):

$$ \mathbf {Y}^*\in \mathop {\mathrm {arg\,min}}_{\mathbf {Y}\in \varDelta _K^n}f(\mathbf {Y}), $$
(2)

where

$$ f(\mathbf {Y})=\sum_{\{i,j\}\in \mathcal {P} }N_{ij} \,d_{\phi} \biggl(\frac{c_{ij}}{N_{ij}},\mathbf {y}_i^\top \mathbf {y}_j \biggr). $$
(3)

Intuitively, the solution Y to (2) is a probabilistic cluster assignment yielding a minimum Bregman divergence between the observed co-occurrence statistics of each pair of data points and the estimated ones. Moreover, each term of f(Y) is weighted by N ij in order to account of the statistical significance of the observations.

The formulation in (2) encompasses the one introduced in the previous section as a special case. Indeed, by considering the parametrization ϕ(x)=−H(x), we have that B ϕ D KL , i.e., the Bregman divergence coincides with the KL-divergence, and by simple algebra the equivalence between (2) and (1) can be derived. For a formal proof we refer to Proposition 1 in Appendix.

Different algorithms for consensus clustering can be derived by adopting different Bregman divergences in (2), i.e., by changing the way errors between observed frequencies and estimated probabilities of co-occurrence are penalized. This is close in spirit to the work (Banerjee et al. 2005b), where a similar approach has been adopted in the context of partitional data clustering. In addition to the formulation corresponding to the KL-divergence, in this paper, we study also the case where a squared 2 penalization is considered in (3), i.e., when ϕ(x)=∥x2 and d ϕ becomes the squared Euclidean distance. This yields the following optimization problem:

$$ \mathbf {Y}^*\in \mathop {\mathrm {arg\,min}}_{\mathbf {Y}\in \varDelta _K^n} \biggl\{\sum_{\{i,j\}\in \mathcal {P} }N_{ij} \biggl( \frac{c_{ij}}{N_{ij}}-\mathbf {y}_i^\top \mathbf {y}_j \biggr)^2 \biggr\}. $$
(4)

In the next section we will cover the algorithmic aspects of the computation of probabilistic assignments, which represent our solution to the consensus clustering problem.

5 Optimization algorithm

In this section, we describe an efficient optimization procedure which allows to find a local solution to (2), which works for any double-convex Bregman divergence. This procedure falls in the class of primal line-search methods because it iteratively finds a feasible descent direction, i.e., satisfying the constraints and guaranteeing a local decrease of the objective.

This section is organized into four parts. The first part is devoted to the problem of finding a feasible, descent direction, while the second part addresses the problem of searching a better solution along that direction. In the third part, we summarize the optimization algorithm and provide some additional techniques to reduce its computational complexity. Finally, in the last part we show how our algorithm can be adapted to efficiently cluster large-scale datasets.

5.1 Computation of a search direction

Given a non-optimal feasible solution \(\mathbf {Y}\in \varDelta _{K}^{n}\) of (2), we can look for a better solution along a direction \(\mathbf {D}\in\mathbb {R}^{K\times n}\) by finding a value of ϵ such that f(Z ϵ )<f(Y), where Z ϵ =Y+ϵ D. The search direction D is said to be feasible and descending at Y if the two following conditions hold for all sufficiently small positive values of ϵ: \(\mathbf {Z}_{\epsilon}\in \varDelta _{K}^{n}\) and f(Z ϵ )<f(Y).

Our algorithm considers search directions at Y that are everywhere zero except for two entries lying on the same column. Specifically, it selects directions belonging to the following set:

$$\mathcal {D} (\mathbf {Y})= \bigl\{\bigl(\mathbf {e}_K^u-\mathbf {e}_K^v\bigr) \bigl(\mathbf {e}_n^j \bigr)^\top : j\in \mathcal {O}, u\in\{1,\dots,K\}, v\in\sigma(\mathbf {y}_j), u \neq v \bigr\}. $$

Here, the condition imposing vσ(y j ) guarantees that every direction in \(\mathcal {D} (\mathbf {Y})\) is feasible at Y (see Proposition 2 in Appendix). Among this set, by taking a greedy decision, we select the direction leading to the steepest descent, i.e., we look for a solution to the following optimization problem:

$$ \mathbf {D}^*\in \mathop {\mathrm {arg\,min}}_{\mathbf {D}\in \mathcal {D} (\mathbf {Y})} \biggl\{ \lim_{\epsilon\rightarrow 0} \frac{d}{d\epsilon}f(\mathbf {Y}+\epsilon \mathbf {D}) \biggr\}. $$
(5)

By exploiting the definition of \(\mathcal {D} (\mathbf {Y})\) the solution to (5) can be written as \(\mathbf {D}^{*}=(\mathbf {e}_{K}^{U}-\mathbf {e}_{K}^{V})(\mathbf {e}_{n}^{J})^{\top}\), where the indices U,V and J are determined as follows. Let U j , V j be given by

$$ U_j\in \mathop {\mathrm {arg\,min}}_{k\in\{1\dots K\}}\bigl[g_j(\mathbf {Y}) \bigr]_k \quad\text{and}\quad V_j\in \mathop {\mathrm {arg\,max}}_{k\in\sigma(\mathbf {y}_j)} \bigl[g_j(\mathbf {Y})\bigr]_k, $$
(6)

for all \(j\in \mathcal {O} \), where g j (Y) is the partial derivative of f with respect to y j , which is given by

$$ g_j(\mathbf {Y})=\frac{\partial}{\partial \mathbf {y}_j}f(\mathbf {Y})=\sum _{i\in \mathcal {P} _j}N_{ji}\mathbf {y}_i\frac{\partial d_\phi}{\partial x_2} \biggl(\frac{c_{ji}}{N_{ji}}, \mathbf {y}_j^\top \mathbf {y}_i \biggr). $$
(7)

Here \(\mathcal {P} _{j}=\{i\in \mathcal {O} :\{i,j\}\in \mathcal {P} \}\). Then, by Proposition 3 in Appendix, J can be computed as

$$ J\in \mathop {\mathrm {arg\,min}}_{j\in \mathcal {O}} \bigl\{\bigl[g_j(\mathbf {Y}) \bigr]_{U_j}-\bigl[g_j(\mathbf {Y})\bigr]_{V_j} \bigr\}, $$
(8)

while U=U J and V=V J .

The search direction D at Y obtained from (5) is clearly feasible since it belongs to \(\mathcal {D} (\mathbf {Y})\) but it is also always descending, unless Y satisfies the Karush-Kuhn-Tucker (KKT) conditions, i.e., the first-order necessary conditions for local optimality, for the minimization problem in (2). This result is formally proven in Proposition 4 in Appendix.

5.2 Computation of an optimal step size

Once a feasible descending direction \(\mathbf {D}^{*}=(\mathbf {e}_{K}^{U}-\mathbf {e}_{K}^{V})(\mathbf {e}_{n}^{J})^{\top}\) is computed from (5), we have to find an optimal step size ϵ that allows us to achieve a decrease in the objective value. The optimal step is given as a solution to the following one dimensional optimization problem,

$$ \epsilon^*\in \mathop {\mathrm {arg\,min}}_{0\leq\epsilon\leq y_{\mathit{VJ}}}f(\mathbf {Z}_\epsilon), $$
(9)

where Z ϵ =Y+ϵ D and the feasible interval for ϵ follows from the constraint that \(\mathbf {Z}_{\epsilon}\in \varDelta _{K}^{n}\). This problem is convex thanks to the assumption of double-convexity imposed on the Bregman divergence (see Proposition 5 in Appendix).

Let ρ(ϵ′) denote the first order derivative of f with respect to ϵ evaluated at ϵ′, i.e.,

$$\rho\bigl(\epsilon'\bigr)= \lim_{\epsilon\rightarrow \epsilon'} \frac{d}{d\epsilon} f(\mathbf {Z}_\epsilon) = \bigl[g_J(\mathbf {Z}_{\epsilon'})\bigr]_U-\bigl[g_J(\mathbf {Z}_{\epsilon'})\bigr]_V. $$

By the convexity of (9) and Kachurovskii’s theorem (Kachurovskii 1960) we have that ρ is non-decreasing in the interval 0≤ϵy VJ . Moreover, ρ(0)<0 since D is a descending direction as stated by Proposition 4. Otherwise, we would have that Y satisfies the KKT conditions for local optimality.

In order to compute the optimal step size ϵ in (9) we distinguish two cases:

  • if ρ(y VJ )≤0 then ϵ =y VJ for f(Z ϵ ) would be non-increasing in the feasible set of (9);

  • if ρ(y VJ )>0 then ϵ is a zero of ρ that can be found in general using a dichotomic search which preserves the discording signs of ρ at the endpoints of the search interval.

In the specific, if the second case holds the optimal step size ϵ can be found by iteratively updating the search interval as follows:

$$ \begin{aligned} \bigl(\ell^{(0)}, r^{(0)} \bigr) &= ( 0, y_{\mathit{VJ}} ) \\ \bigl(\ell^{(t+1)}, r^{(t+1)} \bigr)&= \begin{cases} (\ell^{(t)}, m^{(t)})& \text{if }\rho (m^{(t)} )>0,\\[3pt] (m^{(t)}, r^{(t)} ) & \text{if }\rho (m^{(t)} )< 0\\[3pt] (m^{(t)}, m^{(t)} ) & \text{if }\rho (m^{(t)} )=0, \end{cases} \end{aligned} $$
(10)

for all t>0, where m (t) denotes the center of segment [ (t),r (t)], i.e., m (t)=( (t)+r (t))/2. Since an approximation of ϵ is sufficient, the dichotomic search is carried out until the interval size is below a given threshold. If δ is this threshold, the number of iterations required is log2(y VJ /δ) at worst.

In some cases (9) has a closed form solution. This of course depends on the nature of the Bregman divergence adopted. For instance, if we consider the squared 2 distance as a divergence (i.e., ϕ(x)=∥x2), then f(Z ϵ ) becomes a quadratic polynomial in the variable ϵ which can be trivially minimized in closed-form.

5.3 Algorithm

The proposed consensus clustering method is summarized in Algorithm 1. The input arguments consist of the ensemble of clusterings \(\mathcal {E}\), the parameter ϕ for the Bregman divergence, and an initial guess Y (0) for the cluster assignments (cluster assignments are uniformly distributed in the absence of prior knowledge).

Algorithm 1
figure 1

Probabilistic Consensus Clustering (PCC)

At an abstract level, the algorithm iteratively finds a feasible, descending direction D at the current solution Y (t), computes the optimal step ϵ and performs an update of the solution as Y (t+1)=Y (t)+ϵ D . This procedure is iterated until a stopping criterion is met.

In order to obtain a time complexity per-iteration that is linear in the number of variables, we exploit the extreme sparseness of the search direction D for the update of matrix Y (t) Y (t) (denoted by A (t) in the pseudocode) and for the update of the gradient vectors \(g_{i}^{(t)}\). Each iteration, indeed, depends on these two fundamental quantities. In the specific, the computation of A (t+1) can be obtained in O(n) by simply changing the Jth row and the Jth column of A (t) (it follows from the update formula at line 10). By exploiting A (t+1), the gradient vectors can be computed in O(Kn). In fact, we obtain \(g_{i}^{(t+1)}\) for all \(i\in \mathcal {O}\setminus\{J\}\) by performing a constant time operation on each entry of \(g_{i}^{(t)}\) (lines 12–14) and we compute \(g_{J}^{(t+1)}\) (line 15) in O(Kn) as well. Having A (t) and the gradient vectors computed allows us to find the search direction D at line 8 in O(nK), since it suffices to access each element of the gradient vectors only once to determine J,U and V. Moreover, the computation of the optimal step size at line 9 can be carried out in O(nlog2(1/δ)), if a dichotomic search is employed, and in constant time in cases where a closed-form solution exists (e.g., if ϕ(x)=∥x2). Finally, the update of the solution at line 11 can be carried out in constant time by the sparsity of D . The time complexity of each iteration is thus given by O(nmax(K,log2(1/δ))).

The most costly part of the algorithm is the initialization (2–5) which has O(n 2 K) time complexity. Hence, the overall complexity of the algorithm is O(n 2 K+mn×max(K,log2(1/δ))) where m is the number of required iterations, which is difficult to know in advance. As a rule of thumb, we need mΩ(nK) iterations to converge, because every entry of Y should be modified at least once. In that case the complexity is decided by the iterations only.

Finally, the stopping criterion ideally should test whether D is a descending direction. Indeed, if this does not hold then we know that Y (t) is satisfying the KKT conditions (it follows from Proposition 4 in Appendix) and we can stop. In practice, we simply check if the quantity g J (Y (t)) V g J (Y (t)) U is below a given threshold τ and we stop if this happens. Indeed, if that quantity is precisely zero, then Y (t) satisfies the KKT conditions. Additionally, we put an upper bound to the number of iterations.

5.4 A note on scalability

In applications where the number of data points to cluster is very large, the computation of the whole co-association matrix becomes impossible. In this cases one resorts to sparsifying the co-association matrix by keeping a number of entries that scales linearly with the number of data points.

Our algorithm can be easily adapted to deal with sparse co-association matrices. Assume that \(\mathcal {P} \) contains only a sparse set of observable data point pairs. Let be the expected average number of entries of \(\mathcal {P} _{i}\), i.e., \(\ell=\sum_{i\in \mathcal {O} }|\mathcal {P} _{i}|/n\) and assume that the input quantities c ij ’s and N ij ’s are given only for the pairs \(\{i,j\}\in \mathcal {P} \). Since we need to know the value of \(\mathbf {y}_{i}^{\top} \mathbf {y}_{j}\) again only for pairs of data points in \(\mathcal {P} \), the computation of A (0) is not fully required and only the entries indexed by \(\mathcal {P} \) should be computed. This reduces to O(Kℓn) the complexity of line 2 of Algorithm 1, where n. The same complexity characterizes the initialization of the gradient at lines 3–5. The subsequent updates of matrix A (t) at line 10 and of the gradient at lines 12–15 require only O() and O(K) operations, respectively. By adopting a priority queue (e.g., heap based), the computation of the optimal direction in terms of U, V and J at line 8 requires only an overall complexity of O(Klog2(n)) per iteration. This can be achieved by initially storing in the priority queue the best values of U and V for all \(i\in \mathcal {O} \) and by updating the priorities based on the sparse changes in the gradient values. The optimal step at line 9 can be computed in O(log2(1/δ)), where δ is the tolerance for the dichotomic search. Finally, the update of Y remains with a constant complexity. The overall per-iteration complexity becomes O(max(log2(1/δ),Klog2(n))). As for the number of iterations the considerations made in Sect. 5.3 still hold.

6 Related work

Several consensus methods have been proposed in the literature (Fred 2001; Strehl and Ghosh 2003; Fred and Jain 2005; Topchy et al. 2004; Dimitriadou et al. 2002; Ayad and Kamel 2008; Fern and Brodley 2004). Some of these methods are based on the similarity between data points, which is induced by the clustering ensemble, others are based on estimates of similarity between partitions and others cast the problem as a categorical clustering problem. All these methods tend to reveal a more robust and stable clustering solution than the individual clusterings used as input for the problem. A very recent survey can be found in Ghosh et al. (2011).

Strehl and Ghosh (2003) formulated the clustering ensemble problem as an optimization problem based on the maximal average mutual information between the optimal combined clustering and the clustering ensemble, presenting three algorithms to solve it, exploring graph theoretical concepts. The first one, entitled Cluster-based Similarity Partitioning Algorithm (CSPA), uses a graph partitioning algorithm, METIS (Karypis and Kumar 1998), for extracting a consensus partition from the co-association matrix. The second and third algorithms, Hyper Graph Partitioning Algorithm (HGPA) and Meta CLustering Algorithm (MCLA), respectively, are based on hyper-graphs, where vertices correspond to data points, and the hyper-edges, which allow the connection of several vertices, correspond to clusters of the Clustering ensemble. HGPA obtains the consensus solution using an hyper-graph partitioning algorithm, HMETIS (Karypis et al. 1997); MCLA, uses another heuristic which allows clustering clusters.

Fern and Brodley (2004) reduce the problem to graph partitioning. The proposed model, entitled Hybrid Bipartite Graph Formulation (HBGF), uses as vertices both instances and clusters of the ensemble, retaining all of the information provided by the clustering ensemble, and allowing to consider the similarity among instances and among clusters. The partitioning of this bipartite graph is produced using the multi-way spectral graph partitioning algorithm proposed by Ng et al. (2001), which optimizes the normalized cut criterion (Shi and Malik 2000), or, as alternative, the graph partitioning algorithm METIS (Karypis and Kumar 1998).

These approaches were later extended by Punera and Ghosh (2007, 2008), to allow soft base clusterings on the clustering ensemble, showing that the addition of information on the ensemble is useful; the proposed models were the soft version of CSPA, of MCLA, and HBGF. Additionally they proposed to use information theoretic K-means (Dhillon et al. 2003), an algorithm very similar to K-means, differing only in the distance measure, using KL-divergence, for clustering in the feature space obtained from concatenating all the posteriors from the ensemble.

Topchy et al. (2003, 2004, 2005) proposed two different formulations, both derived from similarities between the partitions in the ensemble, rather than similarities between data points, differently from the case of co-association based approaches. The first one is a multinomial mixture model (MM) over the labels of the clustering ensemble, thus each partition is considered as a feature with categorical attributes. The second one is based on the notion of median partition and is entitled Quadratic Mutual Information Algorithm (QMI). The median partition is defined as the partition that best summarizes the partitions of the ensemble.

Wang et al. (2009, 2011) extended this idea, introducing a Bayesian version of the multinomial mixture model, the Bayesian cluster ensembles (BCE). Although the posterior distribution cannot be calculated in closed form, it is approximated using variational inference and Gibbs sampling, in a very similar procedure as in latent Dirichlet allocation (LDA) models (Griffiths and Steyvers 2004; Steyvers and Griffiths 2007), but applied to a different input feature space, the feature space of the labels of the ensembles. In Wang et al. (2010), a nonparametric version of BCE was proposed.

Ayad and Kamel (2008), followed Dimitriadou et al. (2002), proposed the idea of cumulative voting as a solution for the problem of aligning the cluster labels. Each clustering of the ensemble is transformed into a probabilistic representation with respect to a common reference clustering. Three voting schemes are presented: Un-normalized fixed-Reference Cumulative Voting (URCV), fixed-Reference Cumulative Voting (RCV), and Adaptive Cumulative Voting (ACV).

Lourenço et al. (2011), modelled the problem of consensus extraction taking as input space pairwise information, and using a generative aspect model for dyadic data. The extraction of a consensus solutions is found by solving a maximum likelihood estimation problem, using the Expectation-Maximization (EM) algorithm.

Our framework is also related to Non-negative Matrix Factorization (Paatero and Tapper 1994; Lee and Seung 2000), which is the problem of approximatively factorizing a given matrix M, with two entrywise non-negative matrices F and G, so that MFG. Indeed our formulation can be regarded as a kind of matrix factorization of the co-association matrix in terms of matrix Y Y under the constraint that Y is column stochastic. This particular setting has been considered, for the 2 norm, in Arora et al. (2011) and in Nepusz et al. (2008).

7 Experimental results

In this section we evaluate our formulation using synthetic datasets and real-world datasets from the UCI Irvine and UCI KDD Machine Learning Repository. We performed four different series of experiments: (i) we study the convergence properties of our algorithm on synthetically generated co-association matrices, (ii) we compare the consensus clustering obtained on different datasets with the known, crisp, ground truth partitions using standard evaluation criteria and we compare against other consensus clustering approaches, (iii) we perform an experiment on a large-scale dataset with incomplete partitions in the ensemble, (iv) we perform a qualitative analysis of a real-world dataset by deriving additional information from the probabilistic output of our algorithm.

We evaluate the performance of our Probabilistic Consensus Clustering (PCC) algorithm with KL-divergence (PCC-KL) and with squared 2 divergence (PCC- 2). From the quantitative perspective, we compare the performance of PCC- 2 and PCC-KL against other state-of-the-art consensus algorithms: the classical EAC algorithm using as extraction criteria the hierarchical agglomerative single-link (EAC-SL) and average-link (EAC-AL) algorithms; Cluster-based Similarity Partitioning Algorithm (CSPA) (Strehl and Ghosh 2003); Hybrid Bipartite Graph Formulation (HBGF) (Fern and Brodley 2004); Mixture Model (MM) (Topchy et al. 2004, 2005); Quadratic Mutual Information Algorithm (QMI) (Topchy et al. 2003, 2005).

In order to evaluate the quality of a consensus clustering result against a hard, ground truth partition we convert our probabilistic assignments in hard assignments according to a maximum likelihood criterion. We compare then two hard clusterings \(\mathcal{P}=\{\mathcal {P} _{1},\dots,\mathcal {P} _{k}\}\) and \(\mathcal {Q} =\{\mathcal {Q} _{1},\dots,\mathcal {Q} _{k}\}\) using the \(\mathcal{H}\) criterion based on cluster matching (Meila 2003) and the Adjusted-Rand index (Jain and Dubes 1988), which is based on counting pairs. Note that we assume without loss of generality that \(\mathcal {P} \) and \(\mathcal {Q} \) have the same number of elements, since we can add empty clusters where needed. The \(\mathcal{H}\) criterion (Meila 2003) gives the accuracy of the partitions and is obtained by finding the optimal one-to-one matching between the clusters in \(\mathcal {P} \) with the ground truth labels in \(\mathcal {Q} \):

$$ \mathcal{H}(\mathcal {P} ,\mathcal {Q} )= \frac{1}{n} \max_{\mathbf {v}} \sum _{j=1}^k |\mathcal {P} _{j}\cap \mathcal {Q} _{v_j}|, $$
(11)

where the vector v of the maximization runs over all possible permutations of the vector (1,…,k).

When we have a soft, ground truth partition given in terms of a probabilistic assignment \(\mathbf {Z}\in \varDelta _{k}^{n}\), we evaluate the divergence between a soft consensus partition \(\mathbf {Y}\in \varDelta _{k}^{n}\) and Z in terms of the Jensen-Shannon (JS) divergence. In more details, let D JS (⋅∥⋅) denote the JS-divergence between two distributions given as points of Δ k . Then the divergence between Z and Y is given by

$$\mathcal{J}(\mathbf {Y},\mathbf {Z})=\frac{1}{n}\min_{\mathbf {P}}\sum _{i=1}^n D_{\mathit{JS}}(\mathbf {z}_i\Vert \mathbf {P}\mathbf {y}_i), $$

where the matrix P in the minimization runs over all possible k×k permutation matrices. Similarly to the case of hard partitions, we assume without loss of generality that Z and Y have the same number of rows, since we can eventually add zero rows to fill the gap. Note that \(0\leq \mathcal{J}(\mathbf {Y},\mathbf {Z})\leq 1\) holds for any \(\mathbf {Y},\mathbf {Z}\in \varDelta _{K}^{n}\).

For the qualitative experiments, we analyse the probabilistic assignments Y in order to exploit the information about the cluster uncertainty. For this analysis, we remove probability values, lower than a predefined threshold δ, and then we normalize each column to sum again to one. We measure the normalized similarity between two clusters i, j as the expected value of common elements over the expected cardinality of the ith clusters:

$$m_{ij}=\dfrac{\mathbb {E}_{\mathbf {Y}}[|\mathcal {P} _{i} \cap \mathcal {Q} _{j}|]}{\mathbb {E}_{\mathbf {Y}}[|\mathcal {P} _{i}|]}. $$

Given a set of data points \(\{\mathbf {x}_{i}\}_{i=1}^{n}\), we define the centroid of class k according to Y as

$$\boldsymbol {\mu}_{k}=\dfrac{\sum_j y_{kj}\vec{x}_{j}}{\mathbb {E}_{\mathbf {Y}}[|\mathcal {P} _{k}|]}. $$

Given the weighted matrix M=[m ij ], which is usually sparse, and the centroids, we can visualize the obtained clusters and their relationships simply by drawing them in the plane with a weighted graph. The structures found in these graphs, like paths or cliques, highly depend on the type of data and the geometry of the consensus set, leading to a different and interesting way of interpreting the consensus results.

7.1 Simulated data

We first study the proposed formulation using a synthetic experiment with soft partitions as ground truth. A soft partition Y is determined by generating 4 isotropic, bivariate, planar Gaussian distributions, each consisting of 200 data points, with mean vectors randomly selected in the four orthants, and by computing for each point the normalized probability of having been generated by one of the 4 Gaussians. Given a soft partition Y we artificially generated an ensemble by randomly sampling N=1000 hard partitions with cluster assignments determined by Y and we constructed the corresponding co-association matrices. Figure 1(a) illustrates one example of such a dataset, where there is some overlap between the components, and Fig. 1(b) shows the corresponding co-association matrix. We generated 10 different datasets according to the aforementioned procedure.

Fig. 1
figure 2

Experiment with a 4-component bivariate Gaussian mixture: (a) an example of a dataset and (b) the corresponding synthetically generated co-association matrix

For each dataset we run our PCC-KL and PCC- 2 algorithms with the purpose of recovering the ground truth soft partition Y . Although the optimal number of clusters is K=4, we run our algorithms with a larger value of K=8. This is not a problem as our formulation can automatically tune itself to select a smaller number of clusters. Indeed, we can see from Fig. 2(a) the estimated cluster assignments Y corresponding to the dataset in Fig. 1, where only 4 components have significant probabilities thus confirming our previous claim. We evaluated the divergence between the ground truth soft partition and the recovered one by our algorithms on each of the 10 datasets using the \(\mathcal {J}\)-criterion introduced at the beginning of Sect. 7. Both our algorithms obtained an average divergence of 0.0012 and a standard deviation of ±0.00005, which indicate a good recovery of the ground truth probabilistic cluster assignments.

Fig. 2
figure 3

4-component bivariate Gaussian mixture: Y, estimated cluster assignments and Y , the ground truth cluster assignments

7.2 UCI and synthetic data

We followed the usual strategy of producing clustering ensembles and combining them using the co-association matrix. Two different types of ensembles were created: (1) using k-means with random initialization and random number of clusters (Lourenço et al. 2010), splitting natural clusters intro micro-blocks; (2) combining multiple algorithms (agglomerative hierarchical algorithms: single, average, ward, centroid link; k-means Jain and Dubes 1988; spectral clustering Ng et al. 2001) with different number of clusters, inducing block-wise co-association matrices.

Table 3 summarizes the main characteristics of the UCI and synthetic datasets used in the evaluation, and the parameters used for generating ensemble (2). Figure 3 illustrates the synthetic datasets used in the evaluation: (a) rings; (b) image-1.

Fig. 3
figure 4

Sketch of the Synthetic Data Sets

Table 3 Benchmark datasets

We summarized the performance of both algorithms after several runs, accounting for possible different solutions due to initialization, in terms of \(\mathcal{H}\) and Adjusted Rand criteria, in Tables 4, 5 and 6, 7, respectively. We present for both validation indices, the average performance (avg), the standard deviation (std), maximum value (max), and minimum value (min), highlighting in bold the best results for each data-set.

Table 4 Results from experiments conducted with an ensemble of type 1 evaluated with criterion \(\mathcal{H}\). See Sect. 7.2 for details
Table 5 Results from experiments conducted with an ensemble of type 2 evaluated with criterion \(\mathcal{H}\). See Sect. 7.2 for details
Table 6 Results from experiments conducted with an ensemble of type 1 evaluated with the Adjusted-Rand index. See Sect. 7.2 for details
Table 7 Results from experiments conducted with an ensemble of type 2 evaluated with the Adjusted-Rand index. See Sect. 7.2 for details

The performance of PCC-KL and PCC- 2 depends on the type of ensemble. On Ensemble (1), PCC-KL and PCC- 2, have generally lower performance when compared with EAC and CSPA (both on Adjusted-Rand Index and CI), that seem very suitable for this kind of ensembles. Nevertheless, on the UCI datasets, both obtain promising results: PCC- 2 is the best in 1 (over 9) dataset, and PCC-KL the best in 1 (over 9) and is very close to the best consensus in several situations. On ensemble (2), PCC-KL obtains the best results almost on all data-sets, 7 (over 9).

Its also very important to notice that the standard deviation of the proposed methods is very low, being in almost every datasets very close to zero.

Figure 4 shows examples of obtained co-association matrices, where the matrices have been reordered according to VAT algorithm (Bezdek and Hathaway 2002), to highlight the clustering structure. Its color scheme ranges from black (c ij =0) to white (c ij =N ij ), corresponding to the magnitude of similarity. As is possible to see, the co-association of ensemble (1), has not a so evident blockwise structure, since it was produced splitting of natural clusters into smaller clusters inducing micro-blocks in the co-association matrix; on Ensembles (2), the co-association matrices has a much more blockwise form, as it was generated with a combination of several algorithms with numbers of clusters ranging from small to large. The results show that blockwise matrices are very adequate for the proposed model, even in cases where there is much overlap.

Fig. 4
figure 5

Example of co-association matrices obtained with ensemble (1) and (2)—reordered using VAT (Bezdek and Hathaway 2002)

7.3 A large-scale experiment

In order to show that our algorithm can be used also on large-scale datasets we propose here an experiment on a KKD Cup 1999 datasetFootnote 1. From the available datasets we analysed a subset of “kddcup.data10percent”, consisting of 120.000 data points characterized by 41 attributes distributed in 3 classes. The preprocessing consisted in standardizing numerical features, and discretizing categorical features, arriving to a 39-dimensional feature space. We produced an ensemble consisting of 100 K-means partitions obtained on random subsets of the dataset (sampling rate 50 %) with random initializations and random number of clusters (2≤K≤10). Since the ensemble is composed by incomplete partitions, the consensus clustering phase becomes more challenging.

In order to cope with the large amount of data points, which renders the construction of the co-association matrix impossible both from a space and computational time perspective, we run a sparsified version of our algorithm as described in Sect. 5.4. In the specific, we created \(\mathcal {P} \) by sampling a share of 0.25 ‰ data points pairs from the available (around 8 billions) ones. Our algorithms (PCC- 2 and PCC-KL) were run with a maximum number of nK iterations. Our non-parallelized C implementations of PCC- 2 and PCC-KL took on average 13.8 s and 16.7 s, respectively, to deliver a solution on a dual-core 64-bits Pentium 2.8 GHz with 4 Gb RAM (only one core was effectively used). We were able to compare our algorithm only against CSPA, which nevertheless obtained competitive results in the previous set of experiments. All other approaches could not be run due to the large size of the dataset, or because of their inability of handling incomplete partitions in the ensemble.

We report in Fig. 5 the results obtained in terms of accuracy (\(\mathcal{H}\)-criterion) by the algorithms at varying values of the parameter K. At the optimal number of clusters, K=3, all approaches achieve their best score, but our approach outperforms CSPA both when the KL-divergence and the squared 2-divergence are used, the former being slightly better than the latter. Moreover, it turns out that our approach can automatically tune the optimal number of clusters thus being more robust to overestimations of the parameter K. Indeed, we remark that the parameter K for our approach is intended as a maximum number of clusters rather than the desired number of clusters that the algorithm must deliver. The partitions in the ensemble achieved on average an accuracy of 79 %. Clearly, due to the presence of incomplete partitions, this score has been computed by considering the data points that have effectively been used in each partition.

Fig. 5
figure 6

Results obtained on a large-scale dataset. We report the accuracy obtained by the proposed algorithms (PCC- 2 and PCC-KL) and by a competing algorithm (CSPA) at varying values of the parameter K, the optimal one being K=3

Our consensus solution provides a considerable improvement of this score, confirming the effectiveness of our algorithm also in the presence of incomplete partitions in the ensemble.

In Fig. 6 we report the results in terms of accuracy, when varying the percentage of observed entries in the co-association matrix. The trend on the performance is constant with percentages larger than 0.08 ‰. By further reducing the number of observed entries, we experience a performance drop as one would expect.

Fig. 6
figure 7

Results obtained on a large-scale dataset. We report the accuracy obtained by the proposed algorithms (PCC- 2 and PCC-KL) varying the percentage of observed entries

7.4 Visualizing probabilistic relations

It was observed in Färber et al. (2010), Cui et al. (2010) that we can discover new structures in data using previous known information. A case study we consider here is that of the PenDigits (Frank and Asuncion 2012). The PenDigits dataset contains handwritten digits produced by different persons. Each digit is stored as a sequence of 8 (x,y) positions, collected at different time intervals during the execution of each single digit. A manual analysis of Cui et al. (2010) highlights that the same digit can be written in different ways, but this information is not contained in the ground truth which just collects each type of digit in the same class. These observations become apparent if we build a consensus matrix for each digit taken in isolation and then visualize the obtained classes. Each digit has different ways to be written, but some of them are not completely different, so when building the consensus matrix we can see that these classes overlap. As an example we consider the digit ‘2’. In Fig. 7(a) we can see that the co-association matrix contains 5 blocks, two of which, the second and the third, are highly overlapped. The resulting matrix Y Fig. 7(b) reflects the overlap by assigning uncertainty to the two overlapping classes. The uncertainty is not symmetric, but the third class seems actually a subclass of the second. In Fig. 8 we can see the five centroids of the classes and their pairwise similarities. The centroids are ordered so that the upper image is the first class centroid and the others are in clockwise order. Each centroid visualizes eight points and the order in which they appear. The visualization of the centroids gives an explanation of the similarities/diversities that are numerically encoded on the edges, and in particular it is clear the dependence of class three with respect to class one and two.

Fig. 7
figure 8

On the left, a co-association matrix is built on a subsample of the digit ‘2’ for the PenDigit dataset. Five blocks are present but they are not all clearly separated. On the right, resulting matrix Y obtained by running PCC-KL on the co-association matrix. The overlap of the second and third blocks is captured by the uncertainties in the second and third lines

Fig. 8
figure 9

Graph of the directed, weighted, relations among the centroids found in class ‘2’ of the PenDigit dataset. The first class is the upper image, other classes are ordered clockwise. Each centroid is made of 8 ordered points. The five centroids correspond to the five blocks of Fig. 7(a) and the five rows of Fig. 7(b). Each edge ij is weighted with m ij (see introduction of Sect. 7). A strong dependence, 0.48, is present in the third class with the second and the first ones. This reflects the visual similarity of the respective centroids

8 Conclusions

In this paper, we introduced a new probabilistic consensus clustering formulation based on the EAC paradigm. Each entry of the co-association matrix, derived from the ensemble, is regarded as a Binomial random variable, parametrized by the unknown class assignments. We showed that the log-likelihood function corresponding to this model coincides with the KL divergence between the co-association relative frequencies and the co-occurrence probabilities parametrized by the Binomial random variables. This formulation can be seen as a special case of a more general setting, replacing the KL divergence with any Bregman divergence. We proposed an algorithm to find a consensus clustering solution according to our model, which works with any double-convex Bregman divergence. We also showed how the algorithm can be adapted to deal with large-scale datasets. Experiments on synthetic and real world datasets have demonstrated the effectiveness of our approach with ensembles composed by heterogeneous partitions obtained from multiple algorithms (agglomerative hierarchical algorithms, k-means, spectral clustering) with varying number of clusters. Additionally, we have shown that our algorithm is able to deal with large-scale datasets and can successfully be applied also in case of ensembles having incomplete partitions. On different datasets and ensembles, we outperformed the competing state-of-the-art algorithms and showed particularly outstanding results on the large-scale experiment. The qualitative analysis of the probabilistic consensus solutions provided some evidences that the proposed formulation can discover new structures in data. For the PenDigits dataset, we showed visual relationships between overlapping clusters representing the same digit, using the centroids of each cluster and similarities between clusters obtained from the probabilities of the consensus solution.