1 Clustering based on nonparametric density estimation

1.1 Introduction

The traditional approach to clustering is based on a suitably specified concept of dissimilarity between objects. Alternatively, the model-based approach, which has received much attention in recent years, introduces a finite mixture of density functions belonging to a given parametric class. A third, less common, strand has roots which go back at least to Wishart (1969) who has argued that “natural” groupings should be associated to the modes of the underlying density function and has put forward a procedure, called “Mode analysis”, which selects clusters as disjoint sets of points with density above a certain level. A similar idea has been discussed by Hartigan (1975, Chap. 11) where clusters are defined as the connected components of the level sets of the density function, to be estimated by a nonparametric method. Varying the level which identifies the contour sets gives rise to the cluster tree, a hierarchical structure, where each leaf corresponds to a mode of the density function. However, at that time, these principles were difficult to translate into viable procedures for current usage, due to the involved computational effort. It is only recently that this approach could be reconsidered, leading to a few operational proposals.

This approach to cluster analysis claims some desirable properties over both distance- and model-based clustering. Firstly, the flexibility of nonparametric density estimators allows for detection of groups with arbitrary shape. On the contrary, the shape of model-based clusters depends on the choice of the mixture components, and distance-based methods are known to favour specific clustering structures. Secondly, the lack of an explicit definition of cluster in the distance-based formulation is reflected in the heuristic criteria adopted for selecting their number. In the density-based approach, clusters correspond to the regions around the modes of the data distribution. Hence, their number is conceptually well-defined and, then, operationally estimable (although this may not be a trivial task).

In spite of its generally appealing properties, density-based clustering cannot, of course, be a panacea for all possible problems, when also subject-matter considerations and other problem specific issues are taken into account. As observations are allocated to clusters based on the domain of attraction of the modes, groups which are neither well separated nor markedly cohesive may result, which may not be optimal in some cases.

Moreover, an awkward aspect has seriously limited the application of this approach. While finding high-density connected sets is both conceptually and computationally trivial in the unidimensional case, there is no obvious method for their representation in multidimensional spaces. This reason explains why the specialized literature has mainly focused in exploring methods to recognize the connected components associated to density contours. Cuevas et al. (2000, 2001) and, recently, Rinaldo and Wasserman (2010) represent clusters by focusing on a single level of the estimated density and approximate the connected regions by the union of closed balls centered at the high-density data. Stuetzle (2003) and Stuetzle and Nugent (2010) identify the cluster tree by means of graph theory, and then prune it to remove spurious modes. Ooi (2002) detects dense regions by recursive partitioning of the sample space. Related methods building on the idea of associating groups to high-density connected sets have been proposed by the machine learning community (e.g. Kriegel et al. 2011). A key difference with respect to the techniques described above is that the concept of density itself often does not correspond to a probability distribution.

We shall work with a further density-based clustering procedure, proposed by Azzalini and Torelli (2007), and referred to as PdfCluster. The authors exploit spatial tessellation and related concepts to generalize the notion of contiguity among points in several dimensions and determine the connected level sets.

The computational complexity of PdfCluster depends only mildly on the sample size, but it grows exponentially with the data dimensionality, thus making the application of the procedure hardly feasible for large dimensions.

The present work aims at improving the PdfCluster method by developing a viable solution to the problem of finding connected sets in higher dimensional spaces. We propose a generalization of the ideas at the basis of the univariate procedure and thus shift the formulation from a space with any dimension to one dimension only, thus easing both computation and visualization.

After summarizing PdfCluster, we present our method to cluster multidimensional data, as a natural extension of the univariate procedure. Then, we discuss some inherent aspects related to the curse of dimensionality and its effects on computation and density estimation. Some numerical illustrations follow.

1.2 Description of the PdfCluster procedure

Let \(\mathcal{X}=(x_{1},\ldots,x_{n})\), \(x_{i} \in\mathbb{R}^{d}\), be the observations to be clustered, supposed to be a simple sample from a d-dimensional continuous random variable with unknown density function f, which is only assumed to be differentiable and bounded. For each constant k, 0≤k≤maxf, a section of the density function at the level k identifies the level set:

$$R(k) = \bigl\{x: x \in\mathbb{R}^d, f(x)\geq k\bigr\}. $$

When f is unimodal, R(k) is a connected set. Otherwise, it may be connected or not. If disconnected, it is formed by two or more connected components, corresponding to the regions around the modes of f which are encountered by the section at the k level. Figure 1 illustrates this notion for a simple bivariate example.

Fig. 1
figure 1

A sample simulated from three subpopulations (left plot). A section at level k cuts the trimodal density function (right plot) and identifies two connected regions. The associated level sets are indicated on the left panel (dashed lines)

In order to describe the clustering structure, Azzalini and Torelli (2007) introduce the concept of mode function. For any given choice of the level k, R(k) is associated to both the number of connected components m and the probability p=∫ R(k) f(x) dx. As p is monotone with respect to k, a step function m(p) may be defined, giving the correspondence between p and the number of connected components of R(k), as p varies in (0,1). With the convention that m(0)=m(1)=0, it may be shown that the total number of increments of m(⋅), counted with their multiplicity, is equal to the number of the modes of f. The mode function is closely related to Hartigan’s cluster tree (Hartigan 1975). An illustration of the two concepts is given in Fig. 2.

Fig. 2
figure 2

Empirical mode function (left) and cluster tree (right) for the points on the left side of Fig. 1

The empirical analogue of R(k) is defined as \(\hat{R}(k)= \{x: x \in \mathbb{R}^{d}, \hat{f}(x)\geq k\}\), where \(\hat{f}\) is a nonparametric estimate of f, assumed to be continuous and bounded.

In multidimensional spaces, there is no obvious method to establish if a region is connected. As the interest focuses on the allocation of the observations in \(\mathcal{X}\) only, a first step to simplify the problem is to restrict the attention on the sample level set

$$S(k)=\bigl\{x_i: x_i \in\mathcal{X}, \hat{f}(x_i) \geq k\bigr\}, $$

instead of \(\hat{R}(k)\). Hence, the problem of identifying connected sets is shifted from a continuous multidimensional space to a discrete set. Graph theory comes to our aid because detection of connected components of S(k) may be converted into detection of connected components of a graph \(\mathcal{G}\) whose the elements of S(k) are vertices. A key matter becomes to suitably set the edges of this graph.

In the case of PdfCluster, the task is pursued by making use of the Delaunay triangulation, the graph associated to a spatial tessellation known as the Voronoi diagram. Given \(\mathcal{X}\), the Voronoi diagram is a partition of \(\mathbb{R}^{d}\) in n regions \(\mathcal{V}(x_{1}), \ldots, \mathcal{V}(x_{n})\) such that each region \(\mathcal{V}(x_{i})\) is the set of all points of \(\mathbb{R}^{d}\) closer to x i than to any other point of \(\mathcal{X}\), i=1,…,n, according to some measure of distance. From the Voronoi tessellation, the Delaunay graph can be obtained by connecting with an edge the pair (x i ,x j ) when the corresponding regions of the Voronoi tessellation share a portion of their boundary facets. See the left panel of Fig. 3. From a computational point of view the Delaunay triangulation may be obtained directly, without building the Voronoi diagram.

Fig. 3
figure 3

The left plot displays the Voronoi tessellation (dashed lines) for the points on the left side of Fig. 1, and superimposed Delaunay triangulation (solid lines). In the right plot some edges have been removed from the Delaunay triangulation, those connecting pairs of observations with density below the threshold k represented by the plane in the right plot of Fig. 1

For a given section k of the density estimate \(\hat{f}\), a subgraph is obtained by removing from the Delaunay triangulation of \(\mathcal {X}\) the sample points not belonging to S(k), and the edges having at least one vertex among them. The connected components of this subgraph are identified as the union of connected pairs that share at least one vertex; see, e.g., Even (2011) for a description of some dedicated algorithms. The connected points share one facet of the original Delaunay triangulation, i.e. they belong to adjacent Voronoi regions. As the union of these points represent the connected components of S(k), the corresponding Delaunay facets approximate the connected components of \(\hat{R}(k)\). An illustration is given in the right panel of Fig. 3.

This operation, replicated for a grid of possible values of k, and their associated p’s, gives rise to the cluster tree and to the empirical mode function. In this way, the so called cluster cores are determined, formed by the only data lying in the regions around the detected modes. The remaining low-density points are allocated to the cluster core for which the log-likelihood ratio between the two highest densities, conditional to the group, is maximum. In this work, a minor modification of the classification procedure originally proposed by Azzalini and Torelli (2007) has been considered, in order to account for the lowest precision of the estimates at the lower density points, by weighting the log-likelihood ratios inversely with their standard error.

The use of Delaunay triangulation enjoys some appealing features which makes it specially suitable to the aim of detecting connected components of the density level sets. Firstly, each pair of edge-connected observations shares one facet of the original Delaunay graph. The union of these facets forms a polytope in \(\mathbb{R}^{d}\), that is a connected set. It follows that, for large n, the Delaunay triangulation approximates the unobserved connected components of \(\hat{R}(k)\). Moreover, as the Voronoi diagram produces a partition of \(\mathbb{R}^{d}\) into n regions on the basis of a measure of distance, we may establish some connections with the traditional distance-based clustering methods: (a) clusters detected by the single-linkage algorithm, for instance, correspond to the connected sets associated to the minimum spanning tree (Gower and Ross 1969), which is a subgraph of the Delaunay triangulation; (b) an efficient way of approximating the complete linkage clustering is to first compute the Voronoi diagram (Krznaric and Levcopoulos 1998); (c) the k-means clustering method produces a partition of the observations according to the Voronoi diagram generated by k centroids (Du et al. 1999).

The computational complexity of the Delaunay triangulation is competitive with standard methods based on distance measures when d≤3, because at most O(nlogn) operations are required. However, it grows as \(O(n ^{\lfloor\frac{d}{2}\rfloor})\) for larger d. Then, the triangulation and the related PdfCluster procedure turn out to be burdensome for above 6, say, dimensions. It follows the need to derive an alternative to the Delaunay triangulation computationally feasible in fairly large dimensions.

2 On finding connected components

2.1 Description of the proposed method

The simplicity of the unidimensional procedure to find the connected sets associated to \(\hat{R}(k)\) suggests to us a suitable generalization to the multidimensional setting. Subsets of the real line are connected if and only if they are intervals. Hence, any given pair (x 1,x 2)∈S(k) is connected if the whole segment with end points x 1 and x 2 is included in S(k). In order to decide whether this condition is satisfied, the reasoning may proceed from two different standpoints. Since \(\mathbb{R}\) is totally ordered, it is possible to move in a sequential manner by first evaluating the observation x 3 contiguous to x 1 in the direction of x 2. If x 3S(k), then the next observation contiguous to x 3 is evaluated. If all the adjacent observations in the interval [x 1,x 2] belong to S(k), then the pair (x 1,x 2) is connected. Instead, if the interval [x 1,x 2] is broken by a lower density observation, a new connected set arises. Note that the PdfCluster approach suitably generalizes this former idea by considering the notion of contiguity induced by the Voronoi diagram.

An equivalent way of evaluating the connection between x 1 and x 2 in \(\mathbb{R}\) is to observe the density function along the interval [x 1,x 2]. As both x 1 and x 2 belong to S(k), any “valley” of the density is evidence of a lower density region along the interval [x 1,x 2], which turns out to be disconnected. See Fig. 4 for an illustration.

Fig. 4
figure 4

Section k identifies two connected sets associated with the density contours. A first way of seeing that x 1 and x 2 are disconnected is to look for lower density observations x 3 adjacent to x 1 or x 2. Alternatively, it is possible to look for any valley in \(\hat{f}\) along the segment [x 1,x 2]

In the present paper, we consider a natural generalization of the latter formulation to the multidimensional setting. Hence, any two observations in \(\mathcal{X}\in\mathbb{R}^{d}\) are connected if the density function, evaluated along the segment joining them, has no local minimum, which we refer to as it does not exhibit any valley; otherwise they are disconnected. The evaluation of this condition for all the pairs of sample points allows us to produce a graph, with edges joining the connected pairs.

A similar idea is shared by the Gslclust procedure of Stuetzle and Nugent (2010), where a weighted graph is built with edges associated to the minimum value of the density function along the segments joining pairs of observations. The weights of the edges give a summarizing indication about the features of the density along these segments. In the following, we go beyond this idea by proposing a more comprehensive index.

The above-described procedure mishandles, at a first sight, some non-convex situations. An example is provided by the “clump and banana” data, introduced by Stuetzle and Nugent (2010) and illustrated in Fig. 5. Consider two observations x 1 and x 2 lying in the upper and the lower half of the “banana” group respectively. Since x 1 and x 2 belong to the same cluster, we expect that an edge is set to connect them. However, an evident valley will result on the density function along the segment linking x 1 and x 2 so that the two observations will not result in being directly connected. In fact, x 1 and x 2 can be still allocated to the same cluster because the two observations will turn out to be connected by a sequence of edges linking pairwise-connected points.

Fig. 5
figure 5

The “clump and banana” data: Two or three groups are exhibited, depending on the degree of resolution

Once that the graph has been built to establish the connectiveness among pairs of points, clustering may proceed by faithfully following the PdfCluster approach. For this reason, we will refer to the proposed variant as PdfCluster*. The main steps of the procedure are summarized in Table 1.

Table 1 Main steps of PdfCluster*

2.2 Quantifying the valleys amplitude

According to the procedure described so far, a graph is built by the pairwise evaluation of connectivity of the sample data. Connectivity of a pair of observations, in principle, is established by the lack of valleys in the density function along the segment joining them. However, some allowance must be made for sampling errors, to disregard valleys of negligible extent. This can be accomplished by introducing a suitable measure of the extent of a valley, disregarding those having a measure exceeding a given small threshold.

The density function may exhibit one or more valleys along the segment joining each pair of observations. Hence, we first need to define a measure to evaluate the indentation of one valley and then combine the measures of the detected valleys in a summary index. The method that we choose to measure the indentation of a valley consists of measuring the minimum mass probability necessary to fill the considered valley. Essentially, we emulate the process of pouring water over each valley until it starts to overflow and we then measure the volume of the water poured in.

To describe more formally the process, given the pair (x 1,x 2), define

$$\varphi(t) = \hat{f}\bigl(tx_1 + (1-t)x_2\bigr), \quad t \in[0, 1], $$

to be the estimated density function evaluated along the segment joining x 1 and x 2.

To quantify the indentation of the valleys, we shall introduce a sequence of auxiliary functions φ u (t), t∈[0,1], indicized by u and defined recursively as follows.

  • Initial step:

  • Set u=0 and φ u (t)≡φ(t).

  • Recursive step:

Consider the set of points where φ u takes its absolute minimum in [0,1]. If this set comprises only one point and this point is either 0 or 1, then exit from the recursive step. Also, exit from the recursive step if the set of minimum points is an interval having either 0 or 1 as an ending point.

Otherwise, increment u by 1. Denote the minimum point of φ u as t u , with the further specification that, if it lies on an interval where φ u is constant, we take t u to be any point of that interval. Now, consider \(t'_{u}\), defined as the closest point of a maximum less than t u . If such point does not exist, then take \(t'_{u}\) to be the left ending point of the domain of φ u , namely 0. Similarly, define \(t''_{u}\) as the closest point of a maximum greater than t u or, if it does not exist, set \(t''_{u}=1\). Let \(\delta_{u}=\min(\varphi(t'_{u}), \varphi(t''_{u}))\) and denote with H u the smallest interval defined by solutions of φ(t)=δ u containing t u . Then the u-th auxiliary function is defined as:

$$\varphi_{u}(t)=\left \{ \begin{array}{l@{\quad}l} \delta_u& \mbox{if } t \in H_{u},\\ \varphi_{u-1}(t)& \mbox{otherwise.}\\ \end{array} \right . $$

Now, go back to the beginning of the recursive step.  □

At the end of the recursion, a sequence of functions φ 0φ 1≤⋯≤φ U is defined, where U is the number of detected minima and φ U has no minima by construction. If U>0, then for each pair φ u−1φ u there exists an interval H u where the inequality holds strictly. Figure 6 illustrates the updating mechanism of φ u .

Fig. 6
figure 6

An illustration of the update mechanism of functions φ u . Figure u displays φ u−1 (solid line) and the main elements intervening at step u to update it

Each function φ u allows us to define the measure of indentation of the u-th detected valley of φ(t):

$$ V_u = \frac{\int_{t \in H_u}{ (\varphi_u(t)-\varphi(t) )\, \mathrm{d}t}}{\int_0^1{\varphi_U(t) \,\mathrm{d}t}}, \quad u \in\{ 0,\ldots, U\}. $$
(1)

These U measures are then combined in a comprehensive indicator:

$$ V = \max_{u} {V_u}. $$
(2)

Finally, an edge will be set between observations x 1 and x 2 if Vλ for some small threshold 0≤λ<1.

Figure 7 shows an illustration of these concepts for V>0 and V=0. In practice, V is approximated by evaluating φ and φ u on a finite grid of points.

Fig. 7
figure 7

Evaluation of the connectivity of the two pairs (x 1,x 2) and (x 3,x 4) of data illustrated in Fig. 1. The top plot shows the segments linking the pairs of observations. The middle plot shows φ (marked line) and φ U (thin line). The whole coloured area represents the denominator of the (1) while the shadowed area is its numerator. The valley is evidence of disconnection of (x 1,x 2). The bottom plot refers to the connected pair (x 3,x 4), resulting in V=0

An alternative, possibly more natural, measure to account for the presence of different valleys would sum up each valley contribution. However, that choice could result in large values of the index when the density estimate features several wavelets, even if it is essentially constant. In fact, some empirical experience has shown this choice to be less effective.

Note that the described way of proceeding is consistent with the measure of the prominence of the modes implicitly provided by the mode function: given two discontinuity points p′<p″ of m(p), and m(p) constant over the interval (p′,p″), the difference p″−p′ gives an indication of the prominence of the mode associated to level p′. These ideas have a connection with the excess mass approach of Müller and Sawitzki (1991), used to test the presence of modes in a density function.

We shall mention also a close connection of the described criterion with the methods for mode hunting introduced by Minnotte (1997) in the unidimensional case and then extended to the multivariate setting by Burman and Polonik (2009), that evaluate significance of two distinct modes by testing for the presence of antimodes along the line connecting two modal candidates.

3 Further aspects

It is clear that clustering is driven by the density estimate, which determines the associated connected regions. PdfCluster and its proposed variant are not linked to the choice of a particular density estimator. In practice, all numerical examples of Azzalini and Torelli (2007) make use of a kernel density estimate with normal kernel.

A common formulation for the multidimensional kernel estimator (see, e.g., Scott and Sain 2005) is given by the product kernel estimator:

$$ \hat{f}(x)= \sum_{i=1}^n \frac{1}{n h_1 \cdots h_d} \prod_{j=1}^d K \biggl( \frac{x^{(j)} - x_{i}^{(j)}}{h_{j}} \biggr), $$
(3)

where x (j) is the j-th component of x, the univariate kernel K is usually taken to be a nonnegative function centered at zero and integrating to one, and a different smoothing parameter h j is chosen for each component.

While the choice of the kernel function is known not to have a strong impact on the estimate, selecting the smoothing parameters is more critical. However, in the present context, the issue is less influential than one could expect. Bear in mind that density estimation is merely an intermediate step for the subsequent detection of connected level sets and a rough indication of the location and shapes of high-density regions may be adequate. Azzalini and Torelli (2007) have reported that the simple rule of adopting the asymptotically optimal bandwidth under the assumption of multivariate normality produces sensible results in most cases, even if this choice is clearly non-optimal in a context where a multimodal distribution is likely to underlie the data.

For high-dimensional data, however, clustering methods based on nonparametric density estimation have to face an increased difficulty in recovering the clusters: due to the curse of dimensionality, partitions are less robust to the choice of the smoothing parameters and typically show the occurrence of spurious clusters. On this issue, Scott and Sain (2005) plainly assert that reliable results of the application of kernel methods can be obtained in as many as six dimensions. Although it is clear that density-based clustering methods are not designed for data with hundreds of dimensions, our numerical experience suggests that we are allowed to enlarge, to some extent, the dimensionality of problems that can be handled. The sparsity of data produce generally empty neighborhoods, particularly in the low-density regions. In order to control the increased variability, and the subsequent proliferation of small spurious clusters, a first natural device is to oversmooth the density estimate so that only the most prominent modes are likely to be identified. This is consistent with the (Friedman 1997) assertion that optimal smoothing for kernel discrimination is much larger than the one for density estimation. Clearly, the need of caution should be always borne in mind because oversmoothing has the adverse effect of averaging away features around the modes.

Another (not necessarily alternative) remedy consists of allowing different amounts of smoothing depending on the characteristics of the data, so that large values of the parameters are chosen where data are scarce to account for the excessive variations, and smaller values are chosen in high-density regions. This is in line with Stuetzle and Nugent (2010), who make use of a 1-nearest neighbour estimator. We suggest, instead, the use of an adaptive kernel estimator

$$ \hat{f}(x)= \sum_{i=1}^n \frac{1}{n h_{i,j} \cdots h_{i,d}} \prod_{j=1}^d K \biggl( \frac{x^{(j)} - x_{i}^{(j)}}{h_{i,j}} \biggr) $$
(4)

where each bandwidth is chosen as a function of the observations. Abramson (1982) and Silverman (1986) propose the use of a square root law, with

$$ h_{i, j} \propto f^{-1/2}(x_i). $$
(5)

This is the choice we adopt, as will be clarified below.

4 Numerical examples

In this section, we provide some numerical evidence of the soundness of the proposed technique. The goal of the following empirical analysis is twofold. First, we aim at testing the ability of the proposed method in reconstructing an existent clustering structure. For this reason we use data with a known label class. As a second goal, we compare our method with other clustering procedures. The choice of using a known classification to evaluate clustering might be subject to some criticism because the data could either be still heterogeneous within the given classes or these could be not discriminable with respect to the observed characteristics. However, this choice at least guarantees that clustering methods based on different cluster concepts are evaluated on equal terms.

4.1 Preliminary setting

Clustering methods

The performance of the proposed method is compared with results deriving from the application of some alternative clustering methods: a distance-based clustering method, namely the K -means, a model-based clustering method (Mclust, Fraley and Raftery 2002, 2006) and a direct density-based clustering competitor (Gslclust, Stuetzle and Nugent 2010). It is worth to emphasize that, despite the recent interest on clustering based on nonparametric density estimation, we could not find other empirical comparisons within this class of methods in the existing literature. Moreover, we compare our method with the original PdfCluster approach, whose limitations we aim at overcoming.

Density estimation

In the following examples we use a kernel density estimator with Gaussian kernel and diagonal smoothing matrix. For low to moderate dimensions (up to 5 dimensions) we adopt the asymptotically optimal choice under the assumption of multivariate normality, namely

$$ h_j = \sigma_j \biggl( \frac{4}{(d+2)n} \biggr)^{1/(d+4)}, \quad j = 1, \ldots, d, $$
(6)

where the standard deviation σ j of the j-th variable is replaced by an estimate. As suggested by Azzalini and Torelli (2007), the h j are then multiplied by a shrinkage factor of 3/4 in order to relieve the oversmoothing determined by computing the bandwidths under the assumption of multivariate normality.

For higher dimensions, we use the adaptive estimator (4). We follow the Silverman (1986) approach (5) except that bandwidths are allowed to vary for each dimension, to account for different scales. The proportionality constant is set to 1 and a pilot estimate of f is built as in (3), with bandwidths selected as in (6).

Results of Gslclust refer to the use of a kernel estimator with constant bandwidth across the data and the dimensions. It is selected by least-square cross validation after sphering the data. This choice is in line with the authors’ suggestions. As an alternative, the authors suggest the use of a nearest neighbor estimator. Results are not reported here because they are essentially equivalent to those obtained with the kernel method.

Threshold parameter setting

In order to run PdfCluster* we need to set a tolerance value for the threshold λ. Results that follow refer to the use of a threshold λ=10 %, unless differently indicated. Although this is not necessarily the setting that always produces the best partition, it has been used throughout the paper for uniformity.

Number of groups

In both the original and the extended version of PdfCluster, the number of the groups is detected by the procedure as the number of the modes of the estimated density.

On the contrary, a suitable criterion to select the number of the groups has to be defined for the other clustering methods. Among the several criteria which can be used in a distance-based framework, we consider the GAP statistic (Tibshirani et al. 2000). The Bayesian Information criterion (BIC), computed across mixtures of distinct number of subpopulations is used to select the number of mixture components in Mclust. There exist other, possibly more effective, options but the BIC is certainly the most widespread. As Gslclust associates the clusters to the modes of a density function, it should be able to select automatically the number of the groups, similarly to PdfCluster*. However, the use of a highly variable estimate of the density function (as suggested by the authors) always gives rise to a large number of groups, and the tree of clusters needs to be pruned. Stuetzle and Nugent (2010) consider the excess mass criterion to measure the prominence of the detected modes. Based on the sorted sequence of the excess masses, they keep those modes exceeding the smallest clear jump in that sequence and prune the remaining ones. The excess mass criterion is then adopted in the following numerical examples to select the number of clusters detected by Gslclust.

Further details

For the actual implementation of the examples, the computing environment R (R Development Core Team 2013) has been adopted. In particular, the PdfCluster package (Azzalini et al. 2012) implements the proposed and the original methods. All the adopted R functions have been run by setting the default choices, unless otherwise specified.

4.2 Beetles data

The Flea-beetles data (Lubischew 1962) include 74 measures of 6 body characteristics from three species of beetles: Ch. concinna, Ch. heptapotamica, and Ch. heikertingeri, which constitute the groups to be identified.

Table 2 shows the results of clustering. While the groups identified by Mclust are characterized by a high degree of homogeneity, the BIC incorrectly estimates their number as 4. Completely reversed is the indication given by the use of both Gslclust and K -means, whose associated criteria suggest the absence of groups in data. Incidentally, if we force both methods to obtain three groups, only the partition identified by K -means matches well the actual partition. Instead, if Gslclust is forced to prune the cluster tree at three modes, it produces a partition where two of the three original species are merged and there is a spurious group.

Table 2 Flea beetles data: comparison among groupings

PdfCluster* produces a partition which coincides perfectly with the three known classes and this result is robust over a wide range of values of λ≤30 %.

Similarly to Gslclust, the original PdfCluster method fails in recognizing three groups and merges together two of the three species. The same partition is obtained with PdfCluster*, when λ is set slightly over the 30 %.

4.3 Wine data

The Wine data set (Forina et al. 1986) includes 28 chemical characteristics of 178 wines grown in the same region in Italy but produced from three different cultivars (Barolo, Grignolino, Barbera), which we aim at reconstructing. Only 13 variables are commonly selected among the original 28, and this is also our choice.

Results are reported in Table 3. None of the considered clustering methods is able to get a partition which well recognizes the three cultivars. The GAP statistic suggests not to split the data into clusters when K -means is used. The excess mass criterion, instead, gives a clear indication of partitioning the data into three groups. Nevertheless, Gslclust is not able to distinguish between Barolo and Grignolino. This result makes sense in the light of some additional information about the data: the specimens were collected over a ten year period and those of Barbera are generally more recent than those of the other two cultivars (Forina et al. 1986). Both Mclust and PdfCluster* claim the presence of six groups instead of three. Available information does not help to understand the reasons for separating the six clusters (for instance, we do not know the year of collection of each specimen, having information on the aggregated data only). However, a suitable aggregation of the groups detected by Mclust and PdfCluster* gives rise to the three observed cultivars, with a few misclassified points left. In order to get a partition very similar to the given classes, we need to increase λ or to oversmooth the density function, as suggested in Sect. 3.

Table 3 Wine data: comparison among clustering methods

Because of the computational burden necessary to run PdfCluster, comparison of the original and extended versions of the method has been performed on a reduced data set, obtained by extracting the first three principal components. The two methods produce very similar partitions, and these partitions closely match the actual classification of the wines; see Table 4. For completeness, in Table 5 we also list results deriving from running the other considered clustering methods on the reduced data: while the density-based procedures perform as well as the two versions of PdfCluster, clustering based on distance is still unsatisfactory.

Table 4 Wine data: comparison between pdfCluster and pdfCluster* on the first three principal components of data. Results refer to λ<5 % but they are similar for larger λ
Table 5 Wine data: comparison among clustering methods on the first three principal components

4.4 Olive oil data

The Olive oil data set was originally presented by Forina et al. (1983) and subsequently analyzed by various authors to illustrate classification and clustering techniques. The data represent 8 chemical measurements on n=572 specimens of olive oil produced in 9 areas of Italy: North and South Apulia, Calabria, Sicily, Sardinia coast and inland, Umbria, East and West Liguria. These areas can be naturally merged into three geographical macro-areas: South, Sardinia island, Centre-North. The clustering algorithms have been applied to reconstruct the geographical origin of the specimens. Since the raw data are of compositional nature, totalling 10000, an additive log-ratio transformation has been adopted, following Azzalini and Torelli (2007).

Table 6 shows that the use of the GAP statistic suggests K -means to partition the data into 4 groups. Three of them mainly correspond to the three macro-areas. However, the fourth one includes 170 olive oils coming from all the continental areas. Both Mclust and Gslclust produce a partition in groups which are characterized by a high degree of homogeneity. Except for a few units, all the misclassified observations are allocated in the right macro-area. PdfCluster* produces a partition well-matching the division in macro-areas, and the result is stable for a large range of values of λ. Although the number of misclassified observations is higher than the two other density-based methods, it should be noticed that while the groups produced by PdfCluster* have a geographical interpretation, this is less evident for both Mclust and Gslclust where, for instance, more than one group is associated with the Ligurian areas.

Table 6 Olive oil data: comparison among methods

Also in this case, the performance of PdfCluster and its variant have been compared on a reduced version of the original data set, formed by the first five principal components. PdfCluster essentially identifies the three macro-areas while its extended version tends to produce some spurious clusters on the reduced data. These partitions, not reported here, are similar to the ones produced by Gslclust and Mclust on the complete data, showing a poor homogeneity of the Ligurian oils. In order to get a partition in three groups, we need to set the λ parameter greater than 30 % (Table 7). In that setting, the partitions produced by the two versions of PdfCluster mismatch for about 10 % of observations, all of them from the Ligurian areas. In fact, a look to the Italian geography gives more sense to the observed behaviour of the clustering methods because the geographical distance between Liguria and Sardinia is not much larger than the distance between Liguria and Umbria. Concerning the other methods, K -means cannot reconstruct the original clusters on the first five principal components; Mclust produces a fair representation of the actual regions, while Gslclust notably detects the three macro-areas almost perfectly; see Table 8 for details.

Table 7 Olive oil data: comparison between PdfCluster and PdfCluster* on the first five principal components of data. Results refer to a value of λ set slightly over 30 %
Table 8 Olive oil data: comparison among methods on the first five principal components of data

4.5 Climate data

As a final real data application, we consider a large data set drawn from the National Climatic Data Center (NCDCFootnote 1), including several dozens of monthly variables such as temperature, relative humidity, precipitation, wind speed and direction, etc. Some additional variables characterizing the stations have been included, e.g. latitude, longitude and ground height. Data refer to December 2012 and have been collected from about a thousand of US land stations.

Preprocessing of the data has consisted of removing records and variables counting too many missing values and categorical variables. Suitable power and logarithmic transformations have been operated to reduce the strong skewness of the data. The final sample includes 490 observations and 22 variables.

Unlike the previous examples, we do not compare results of the application of clustering methods with known classification, since it does not explicitly exist. However, we may get a clue about the quality of the partitions by observing how the clusters lay in the geography of the US. Figure 8 displays results deriving from the application of PdfCluster*, Gslclust, and K -means plotted on the US map according to the spatial coordinates of the observed stations. Results of the application of model-based clustering are not reported because Mclust detects one group only. While the three partitions differ in the number of groups, we still notice an appreciable agreement and similar boundaries. The two macro-areas identified by K -means are approximately delimited by the 100th meridian except for the west coast allocated to the eastern group. PdfCluster* detects four clusters: a large macro-area corresponding to the Great Plains and the Rocky mountains is mainly characterized by low temperatures, the strongest winds and an arid climate. A second group, essentially covering the great lakes region and the north-eastern coast, shows humid continental characteristics. A third cluster, moving to the North from the Gulf of Mexico, presents quite high temperature and is especially characterized by a high number of thunderstorm days. This is in agreement with the formation of hurricanes in the area. The fourth cluster mainly features high temperatures, precipitations and fog, and it covers both the south-eastern and the south-western coasts. Gslclust finds eight smaller and more homogeneous groups which roughly merge in the four clusters found by PdfCluster*. Interestingly, if we force the two methods to produce the same number of clusters by suitably tuning the respective parameters, we obtain very similar partitions.

Fig. 8
figure 8

Climate data: partitions produced by PdfCluster* (left), Gslclust (centre), K -means (right)

It is worth noting that, considering the intrinsic limitation of nonparametric density-based clustering due to the effects of the curse of dimensionality, obtaining sensible clustering with more than twenty variables and a few hundreds observations is a satisfactory outcome.

The considered methods and the Delaunay-based version of PdfCluster have been also applied on the first five principal components of the data. Except for minor differences, the partitions detected (not reported here for brevity) are not much dissimilar to the results of PdfCluster* on the original data.

4.6 Simulated data

4.6.1 Simplex data

As a first synthetic example, we consider the sample of size 570 introduced by Stuetzle and Nugent (2010), consisting of seven spherical Gaussian clusters with the same standard deviation σ=0.25, centered at the vertices of the unit simplex in 7 dimensions.

Due to the spherical shape of the clusters, K -means and Mclust get a head start with respect to Gslclust and both versions of PdfCluster, and the partitions they produce coincide with the original groups almost perfectly. However, Table 9 shows that also the clustering methods based on nonparametric density estimation perform accurately. Results reported in Table 10 refer to the use of the two versions of PdfCluster on the first three principal components. Because of the geometry of this data set, reducing the dimensionality by means of principal components results in hiding some groups and, then, worsening the performance of the clustering methods. However, the partition produced by the two methods overlap almost perfectly and the same occurs when a different number of principal components is used. Comparable results are obtained by running Mclust on the reduced data. Lastly, Gslclust merge two further clusters and the distance criterion identifies two groups only; see Table 11.

Table 9 Simplex data: comparison among partitions
Table 10 Simplex data: comparison between the two versions of PdfCluster. Results refer to the use of the first 3 principal components of the data and λ=5 %
Table 11 Simplex data: comparison among clustering methods on the first three principal components of the data

4.6.2 Unimodal data

Application of clustering methods on a set of data usually relies on a conjecture about the existence of interesting partitions of the data in groups. However, prior information of this sort is often not available. Then, it is of interest that a clustering method is able not only to detect homogeneous groups in the data, but also to recognize when data do not show evidence about the presence of groups. This reason motivates the choice of showing the behavior of the considered clustering procedures when data do not exhibit any clustering structure. Samples of size n=100 have been generated from the following distributions:

  1. (a)

    bidimensional t-distribution with skewness introduced according to Sahu et al. (2003) and parameters set as in Fig. 1c of De la Cruz (2008) to obtain level sets having non-convex shapes.

  2. (b)

    d independent negative exponential distributions with means equal to one and d=5,10.

The use of the excess mass criterion required by Gslclust does not allow a fully automatic selection of the number of groups and the choice is partially subjective. In order to run simulations, we have then pruned the cluster tree at the maximum gap greater than one in the sorted sequence of measures of the modes.

Table 12 lists the distribution of the number of clusters detected on 100 samples. Results extensively vary across the clustering methods. The GAP statistic does not perform well, mainly due to the use of a random uniform hypercube as a reference distribution for testing the existence of one cluster only. As Mclust is biased toward the identification of elliptical clusters, it fails in the considered examples that, instead, present a strong skewness. Of course, as the skewness of these examples can be easily detected by a visual exploration of the data, a more targeted parametric distribution could have been assumed for the mixture components. For this reason, we also run simulations by fitting finite mixtures of Skew Normal variables, making use of the R package mixsmsn (Prates et al. 2012) and their use give rise to much better results, especially in 10 dimensions. On the other hand, the use of more flexible distributions, in general, relieves the problem of identifying clusters of arbitrary shape, but can still not be sufficient. In addition, the multivariate nature of the data usually prevents from making the optimal assumption about the distribution of the mixture components.

Table 12 Distribution of number of clusters in 100 samples of size n=100 from d-dimensional unimodal distributions

Considering that no prior information about the nature of the data is used, clustering procedures based on mode recognition have performed satisfactorily.

5 Discussion

The aim of this work was to overcome the computational limitations of Delaunay triangulation to find connected sets when clustering is performed by the PdfCluster method. We have proposed a procedure which generalizes the ideas at the basis of the univariate approach for identifying connected sets, thus moving the focus from a space with arbitrary dimension to one dimension only and leading benefits both in computation and visualization.

Although our experience suggests that PdfCluster* tends to identify more clusters than PdfCluster, it turns out to be a fair approximation of the original method, as a sensible setting of the parameter λ usually produces partitions which closely match the ones detected by the latter method. The number of required operations to detect connected sets in PdfCluster grows exponentially with the dimension of the problem. On the other hand, the proposed procedure has a computational complexity which is quadratic in the sample size, thus being comparable with distance-based methods. Hence, the two PdfCluster methods may be thought of as two complementary versions of the same approach, working the one in small to moderate dimensions and possibly large sample sizes, the one virtually in any dimension. The ability of PdfCluster* to work in high dimensional space is paid by a memory use which increases quadratically with the sample size, as it occurs with hierarchical distance-based methods.

The good performance should not distract from the caution required by some aspects of the proposed method. Firstly, the extension of PdfCluster requires the introduction of the additional parameter λ whose selection is not automatic. As it is a tolerance measure of existence of a valley in the density function, λ should assume small values and the empirical evidence has shown that sensible results are obtained when λ is set between 5 and 30 %. Usually λ=10 % is a reasonable compromise for moderate to high dimensions while a lower λ is sufficient in low dimensions, when the density estimate is still quite accurate. When groups are well separated they have proven to be very stable over a wide range of different values of the parameter. In the presence of little indented valleys separating groups, increasing the value of λ entails the aggregation of clusters. In this sense, selecting the value λ plays a role analogue to pruning the cluster tree. While it is clear that such arbitrariness in setting the value of the parameter is not desirable, it should be borne in mind that some degree of subjectiveness is unavoidable in clustering real data. Because of the arbitrariness of the concept of cluster itself, cluster analysis is intrinsically an ill-posed problem. See again, for instance, the “clump and banana” data in Fig. 5. At a first sight, data exhibit two well separated groups. However, a further inspection would suggest the existence of three groups, being two of them the upper and lower half of the “banana” cluster. This is also the indication of PdfCluster*, producing three groups for λ<10 % and two groups for larger values. In some sense, we might say that the proposed method emulates the human eye, providing a different cluster resolution for different values of the threshold λ.

A second aspect to be cautious about is related to density estimation. As previously observed, the larger data dimension, the more troublesome the use of nonparametric methods. Numerical evidence has shown that acceptable results may be still obtained also in moderately high dimensions. However, it does not justify the uncritical application of the method which should be firstly driven by some knowledge of the phenomenon under study. Furthermore, dimension reduction techniques and methods aimed at revealing features of multivariate data are often advisable. Some recent examples in the clustering framework are provided by Menardi and Torelli (2012) and Dazard and Rao (2010).