Exploring the expressiveness of abstract metabolic networks

Irene García; Bessem Chouaia; Mercè Llabrés; Marta Simeoni

doi:10.1371/journal.pone.0281047

Abstract

Metabolism is characterised by chemical reactions linked to each other, creating a complex network structure. The whole metabolic network is divided into pathways of chemical reactions, such that every pathway is a metabolic function. A simplified representation of metabolism, which we call an abstract metabolic network, is a graph in which metabolic pathways are nodes and there is an edge between two nodes if their corresponding pathways share one or more compounds. The abstract metabolic network of a given organism results in a small network that requires low computational power to be analysed and makes it a suitable model to perform a large-scale comparison of organisms’ metabolism. To explore the potentials and limits of such a basic representation, we considered a comprehensive set of KEGG organisms, represented through their abstract metabolic network. We performed pairwise comparisons using graph kernel methods and analyse the results through exploratory data analysis and machine learning techniques. The results show that abstract metabolic networks discriminate macro evolutionary events, indicating that they are expressive enough to capture key steps in metabolism evolution.

Citation: García I, Chouaia B, Llabrés M, Simeoni M (2023) Exploring the expressiveness of abstract metabolic networks. PLoS ONE 18(2): e0281047. https://doi.org/10.1371/journal.pone.0281047

Editor: Paul Gladstone Livingstone, Cardiff’s Metropolitan University: Cardiff Metropolitan University, UNITED KINGDOM

Received: September 26, 2022; Accepted: January 16, 2023; Published: February 9, 2023

Copyright: © 2023 García et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The original data underlying the results presented in the study are available from the KEGG database (https://www.kegg.jp/). Derived data and scripts are available at https://osf.io/3kmp6/?view_only=b04d419b7bea45208938302522738d66.

Funding: This work was partially supported by DAIS - Ca’ Foscari University of Venice within the IRIDE program; Ministerio de Ciencia e Innovación (MCI), the Agencia Estatal de Investigación (AEI) and the European Regional Development Funds (ERDF) through its support to the project PGC2018-096956-B-C43; the grant PID2021-126114NB-C44 funded by MCIN/AEI/ 10.13039/501100011033 and by "ERDF A way of making Europe.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Metabolism is the organism’s machinery that breaks down complex organic molecules to produce the energy and building blocks needed for the organism’s normal functioning. It comprises all chemical and physical processes that occur within the cells of living organisms and that allow for maintaining life. These processes interact with one another, creating a complex network structure. Each process, in turn, is made up of a pathway of chemical or physical reactions, and all the pathways together form the whole metabolic network. Even though all free-living organisms perform essential metabolic functions (e.g., glycolysis, ATP synthesis, nucleic acid synthesis and degradation), the underlying pathways and chemical reactions that carry them out may differ. The variety of habitats led organisms to diversity and adaptation and, in the process, to evolve their metabolic pathways to carry out similar metabolic functions under different ecological niches [1]. Metabolism is shaped not only by the cell’s genome but also by environmental conditions (e.g., nutrient availability, temperature, pH). A particular case is the metabolism of organisms in a mutualistic relationship, where one organism produces what is needed by the other and vice versa [2–5]. Hence, metabolism is not a static and preserved network during the organism’s life history, but a dynamic network that has evolved to adapt to different conditions for the organism’s survival. Consequently, two pivotal questions that need to be addressed arise: how did metabolism and, by extension, metabolic pathways evolve along the tree of life to adapt to changing conditions and habitats? And, what is the minimal metabolic information that makes it possible to recover and reconstruct metabolisms’ evolution?

Literature on metabolomics in the last two decades ranges from the analysis of single pathways [6–9], to the comparative analysis of a set of pathways [10], together with the metabolic networks dynamics [11, 12]. Several approaches analyse topological features of metabolic networks, such as the networks connectivity, centrality, strongly connected components or node degree [13–15]. Moreover, various approaches to metabolic network reconstruction, analysis and comparison have been proposed; see [16–19] for surveys on different approaches and tools.

Metabolism reconstruction and comparison can be automated thanks to the knowledge stored in metabolic databases such as BioCyc [20], BioModels [21] and KEGG (Kyoto Encyclopedia of Genes and Genomes) [22–24]. However, metabolism reconstruction generally requires human intervention since data repositories can be incomplete, heterogeneous and incoherent. Furthermore, the comparison of metabolic networks is computationally challenging due to the huge number of chemical reactions involved in metabolism. A crucial choice that greatly influences metabolism reconstruction and comparison is how a metabolic network is represented. On one hand, the more faithful the representation is, the more reliable the reconstructed metabolism and the comparison results will be. On the other hand, a fine-grain representation demands high computational costs. The approaches in the literature usually look for a compromise between representation accuracy and computational costs.

In this paper, we are interested in exploring the potential of a simplified representation of the metabolism as a graph, where nodes are metabolic pathways, and there is an edge between two nodes if their corresponding pathways share one or more compounds. We call it an abstract metabolic network (AMN, for short) since it does not include any information about the single biochemical reactions within each pathway. AMNs are a minimal description of the metabolism that is attractive for two main reasons. The first one is that they are small size graphs compared to other graph-based representations of the metabolism. A mean size metabolic network contains about one thousand reactions, which means having one thousand nodes in the corresponding graph, while the mean number of metabolic pathways is around a hundred, which becomes a hundred nodes in the corresponding AMN. This makes AMNs suitable for large-scale metabolism analysis, as they require a much lower computational effort. The second reason AMNs are interesting is that they allow for metabolism comparison from a novel perspective that can yield new insights in the field. Since AMNs do not take into account the biochemical reactions information, metabolism comparison through AMNs means comparing the topological organisation of the metabolism into functional units, which is rather different than performing the comparison through a detailed reaction-based representation. The effectiveness of AMNs needs to be investigated. However, while it seems reasonable to employ AMNs to verify whether the topological organisation of metabolic pathways played a role in evolution, their use for fine-grain metabolism comparison appears unsuitable.

Similar representations of the metabolism have already been introduced in the literature. In [25], a metabolic network is represented at two distinct levels: a structural level, which corresponds to AMNs, and a functional level, which represents the functional role of each pathway through their reactions. In [26, 27], the authors define the so-called NIP (Network of Interacting Pathways), which are similar to AMNs. However, unlike AMNs, NIP also includes currency compounds (except water) when considering the shared compounds between two pathways. Moreover, NIP edges are weighted with the number of shared compounds, so more information is considered w.r.t AMNs.

The goal of this paper is to explore the expressiveness of AMNs by performing an automatic large-scale metabolism analysis to test whether AMNs are able to discern well-known taxonomic groups. A positive answer would mean that AMNs allow for a first meaningful classification that requires low computational effort. It would also emphasise that evolutionary pressures affected not only genes and protein sequences, but also the topological organisation of the functional units in the cell.

To accomplish our goal, we consider a comprehensive set of organisms whose metabolic data is stored in the KEGG database, an extensive, reliable and widely used resource explicitly designed to present data in a standardised way, a key feature in our approach. We build the AMNs of the KEGG organisms and compare them in pairs by using graph kernels (see e.g. [28] for a survey). Comparing graphs is a difficult task, and graph kernels can be seen as a means to work around the complexity issue since they allow for a workable, although not exact, comparison. They have been widely used in a variety of contexts, and some approaches in the literature apply them to represent and compare metabolic networks, see e.g. [29–31]. Here we explore four different kernels: Vertex Histogram (VH), Shortest Path (SP), Weisfeiler-Lehman (WL) and Pyramid Match (PM). Each of them bases the comparison on specific topological characteristics: VH considers nodes only, SP looks at the shortest paths, WL is based on subtrees comparison, and PM considers the whole graph structure. The comparison results are analysed through exploratory data analysis and machine learning methods. We performed three main tests: the first involved the entire dataset, made of 7141 KEGG organisms. The second focused on Eukaryotes’ subset and also explored each eukaryotic kingdom separately. The third experiment focused on Prokaryotes.

The results demonstrate that AMNs are expressive enough to recognise key steps in metabolism evolution. The four considered graph kernels show similar patterns at higher taxonomical level, i.e. kingdoms, thus suggesting that the signals from the metabolisms’ comparison are clear and robust. In particular, AMNs are able to distinguish domains and kingdoms, although there is no clear separation between Bacteria and Archaea. Within Eukaryotes, we observe a good separation among kingdoms and within kingdoms, further evolutionary signals can be observed.

In addition to these results, the analysis highlights a new paraphyletic clade that we define, for the purpose of this study, as the Protophotophytes. They are the photosynthetic eukaryotes (macro and micro) whose photosynthetic capabilities derive from acquiring photosynthetic bacteria during the primary endosymbiotic event. The clade includes plants, green algae, and red algae.

Methods

This section illustrates the details of our methodology to automatically derive AMNs from KEGG metabolic data, compare the metabolisms of different organisms using graph kernels, and analyse the comparison results. Our analytical workflow is composed of three main steps: representation, comparison and analysis. The overall view of the approach is described in Fig 1.

Download:

Fig 1. Overall view of the proposed methodology, composed by three steps: Representation, comparison and analysis.

https://doi.org/10.1371/journal.pone.0281047.g001

In the representation step, we retrieve from KEGG the metabolic pathways of the considered organisms and build their AMN. The latter is defined as a graph where each node represents a metabolic pathway, and each edge between two nodes indicates that they share one or more metabolites. Nodes are labelled with the KEGG’s pathway identifiers, and edges are not labelled. This defines a very high-level and abstract view of the metabolism that does not include any information about the biochemical reactions within each pathway.

In the comparison step, the obtained graphs are pairwise compared using graph kernels. Each kernel produces a similarity matrix containing the comparison results. We explored the use of four graph kernels: Vertex Histogram (VH), Shortest Path (SP), Weisfeiler-Lehman (WL) and Pyramid Match (PM). Each of these kernels bases the comparison on specific topological characteristics, but all of them include nodes labels among the features taken into consideration. This means that all kernels evaluate, among the various aspects, the similarity of the sets of pathways present in the two graphs.

In the analysis step, we employ exploratory data analysis techniques as well as machine learning methods to analyse the similarity matrices obtained by the kernels. Exploratory data analysis comprises visualisation techniques such as Multi-Dimensional Scaling (MDS) and simple heatmaps to enhance a first qualitative view of the similarity matrices, while clustering techniques, such as k-means and hierarchical clustering, are used to classify the considered organisms and compare the obtained clusters with the taxonomic groups mention. In some experiments, we also used a Support Vector Machine (SVM) as a supervised prediction method to further support the clustering results.

We proceed now by detailing the techniques adopted for the representation, comparison and analysis steps.

Metabolism representation: Building abstract metabolic networks from KEGG data

Our metabolism representation is based on the metabolic data stored in the KEGG database [22–24]. Data of the 7141 considered organisms was downloaded at the end of December 2021. The complete list of organisms can be found in the Supporting information, see S1 File.

We used KEGG as the sole source of metabolic data. Our approach strictly depends on the data representation and the knowledge available in KEGG, including data incompleteness, inconsistency and biases that could reflect negatively on our comparison method and the subsequent analyses. It is worth highlighting that biases exist in all metabolic databases and cannot be avoided while performing large-scale automated processes. Nonetheless, we can count on the fact that KEGG is a widely known database, whose content is constantly updated based on new knowledge.

The metabolic activity in KEGG is divided into various categories. Each category is then composed of various metabolic pathways. In order to consider the whole metabolic network, all categories and all the corresponding pathways are contemplated. The list of categories and corresponding pathways is available in the Supporting information S2 File.

KEGG associates to each metabolic function a unique reference pathway map that is not organism specific and contains integrated information from different organisms. A reference map can be expanded into organisms-specific pathways by combining the gene information in individual genomes. This representation of pathways plays an important role in avoiding incoherence in our metabolism representation and in automatising the whole process.

We represent the metabolic network of a given organism through a graph called abstract metabolic network (AMN), whose topology reflects the metabolic pathways stored in KEGG for that organism and their interconnections. Specifically, an AMN is an undirected graph in which each node identifies a metabolic pathway and an edge between two nodes indicates that the two corresponding pathways share one or more compounds among the ones defining their reactions network. Nodes are labelled with the KEGG pathway identifiers and edges are not labelled. Note that a shared compound C between two pathways may represent different situations:

C is produced by one pathway and consumed by the other;
C is a compound used in the shared part of two overlapping pathways;
C is a compound present in both pathways, even though they express unrelated functions or work in different environments or conditions.

Ubiquitous compounds such as H₂O, phosphate, ATP and ADP are not considered, since, though they are needed and assumed to be present everywhere, they are not “representative” compounds of any pathway and their consideration would only add noise to our representation.

Note that all AMNs share the same set of node labels, namely, the set of KEGG’s pathways identifiers. Moreover, in each AMN all labels have at most one occurrence, since there is at most one node for each pathway.

In this representation, all the reactions involved in the various pathway are omitted, and just the minimal information about the presence of pathways and their interconnections is taken into consideration. This clearly results in an abstract and coarse-grain view of the metabolism.

KEGG supplies two related representations for each pathway in its repository: a graphical representation (pathway map), showing the pathway content, and a textual one written in an XML format, a KGML file, where KGML stands for KEGG Markup Language. Such a file contains the information represented in the corresponding map.

KGML files only contain reactions information for pathways that include a chemical network; we call them chemical pathways. If a chemical pathway is present in the metabolism of an organism, its corresponding KGML file exists and contains one or more chemical reactions. The pathways based on physical mechanisms, for instance membrane pathways, do not contain information on reactions; we call them physical pathways. If a physical pathway is present in the metabolism of an organism, its corresponding KGML file exists but do not contain any chemical reactions.

To automatically reconstruct the metabolism of a specific organism as an AMN, it is necessary to download the KGML files of all pathways of the organism through the public KEGG’s APIs [32], and parse each file to extract the relevant information for our representation. The resulting graph is then stored locally. We implemented a specific command-line Java program that performs the entire job.

Comparison through graph kernels

Graph kernels can be intuitively understood as functions measuring the similarity of pairs of graphs. Given two graphs in input, a graph kernel calculates a real number that measures their similarity according to some graph’s features. More in general, given a set of n graphs in input, the kernel’s output is an n × n similarity matrix where each cell (i, j) contains the result of the pairwise comparison between graphs i and j. In our case the matrix is normalised, that is, all similarity values are real numbers in the interval [0, 1], where 0 and 1 indicate the minimum and maximum similarity values, respectively. The matrix is symmetric and the main diagonal contains only 1 values, each cell being the comparison of a graph with itself.

We used four different kernels to compare the abstract representation of metabolisms, all implemented by the GraKel Python library [33]. Since nodes in AMNs are labelled with the KEGG pathways identifier, we decided to exploit such information in the comparison. For this reason, the selected kernels include node labels among the features taken into consideration for the comparison.

A formal introduction on graph kernels can be found, e.g., in [28, 34–38]. Here follows a brief description of the four kernels employed in this paper, while in the Supporting Information we provide their detailed mathematical formalisation, see S3 File.

Vertex Histogram: This is the simplest kernel. It compares node-labelled graphs and it only considers the node labels of the graphs to compare. Given a set of node labels , the vertex label histogram of a graph counts, for each label , the number of nodes in the graph that are labelled by i. In the particular case of a labelled graph being the AMN of a given organism, every node has only one label and different nodes have different labels. Therefore, this kernel only considers the presence or absence of pathways (node labels) in the two organisms under comparison and sum up the number of common pathways. Since only nodes and their labels, and not edges, are considered, the way metabolites are exchanged between pathways does not affect the comparison result.
Shortest-Path: This kernel compares graphs in terms of their shortest paths. Intuitively, this means measuring how easy or difficult it is to connect, through shared compounds, couples of pathways in the two AMNs. It first computes the shortest-paths graph of every input graph, that is, a weighted graph with the same nodes as the original graph and with weighted edges corresponding to the shortest paths between each pair of nodes. The comparison is then performed by considering each pair of shortest paths in two given graphs and comparing their length and the labels of their endpoint nodes. The resulting scores are summed up to form the final similarity score.
Weisfeiler-Lehman (WL) subtree kernel: Roughly speaking, this kernel compares the subtrees rooted on each node of the two graphs under consideration, taking into account also their node labels: the more the subtrees structures and labels coincide, the higher is their similarity. The fundamental idea of the Weisfeiler-Lehman algorithm is to replace the label of each node by a multiset of labels consisting of the original label of the node and the sorted set of labels of its neighbours. The resultant multiset is then compressed into a new short label, which reflects the knowledge of the node and its neighbourhood. This relabelling process is then repeated for h iterations. By performing this procedure simultaneously on all input graphs, it follows that two nodes from different graphs will get identical new labels if and only if they have identical multiset labels. The kernel function, in this case, compares the node labels of the graphs resulting after each iteration and summarises the comparison with a real number. It can be shown that this is equivalent to comparing the number of shared subtrees between the two input graphs (the kernel considers all subtrees up to height h).
Pyramid Match: Intuitively, this kernel measures the similarity of the topological features (e.g. connectivity) of the nodes with the same label in the two compared graphs. Described at a high level, the Pyramid Match graph kernel first embeds the graph nodes into a vector space by considering the eigenvectors of the k largest in magnitude eigenvalues of the adjacency matrix of the graph. Thus, the nodes become points of a k dimensional hypercube. Next, to assess the similarity of two sets of nodes, the kernel maps these points into multi-resolution histograms and compares them with a weighted histogram intersection function. In the case of labelled graphs, the kernel restricts the matchings to occur only between nodes that share the same labels.

To conclude, we remark that the great advantage of graph kernels is that they provide a computationally feasible way to compare graphs. We exploit this characteristic throughout the paper to perform large-scale comparison of thousands of AMNs in a few minutes. However, we also want to point out that graph kernels do not provide insights on the agreements or disagreements of the input graphs, i.e, they do not allow for recovering the nodes and edges of the input graphs that are responsible for the obtained similarity result.

Data visualisation with heatmaps and multi-dimensional scaling

A heatmap allows for a visual rendering of the similarity matrix produced by graph kernel. Each cell (i, j) in the heatmap shows the similarity value between the i-th and j-th AMNs, colour-coded so that darker colours correspond to a high degree of similarity (from dark blue to yellow). In our case the heatmaps are symmetric and cells in the main diagonal always show the darkest colour, resulting from comparing an organism with itself. As it is a challenge to annotate rows and columns with the KEGG identifier of each organism, a detailed view of the data is difficult to obtain. We used coloured bars on rows and columns to indicate the various taxonomic groups of organisms to ease this difficulty.

A different technique is Multi-dimensional Scaling (MDS), a classic multivariate data analysis technique that allows for obtaining a low-dimensional representation of the observed similarities [39]. Given a normalised similarity matrix Q produced by a graph kernel, we implemented a metric MDS by transforming Q into a distance matrix with elements (1) where q_ij is the element in row i and column j in Q and q_ii = q_jj = 1.

The objective of the MDS is to represent the observed distances using a set of variables y₁, …, y_k, where k < n, such that the Euclidean distances between the coordinates of the elements for these variables are close to the original distances. In this way, the graphical representation in k dimensions will faithfully reproduce the observed structure. We executed the metric MDS using the R package [40], a language and environment for statistical computing.

Clustering techniques

We used clustering techniques on the similarity matrices produced by graph kernels to identify patterns or groups of similar organisms. In particular, we used the R package to perform partitioning clustering methods, such as k-means, and agglomerative hierarchical clustering [41].

K-means clustering divides the dataset into k subsets, where k is pre-specified. The algorithm finds the best k groups, the best being those where the within sum of square (WSS) is the smallest possible and where the sum of squares between clusters is the largest. Each cluster is represented by the mean of the data points belonging to the cluster itself. We specified the number of clusters both according to the phylogeny and using the so-called Elbow method [39]. The Elbow method consists in plotting the value of the WSS produced by different values of k. The bend (elbow) location in the plot is generally considered an indicator of the appropriate number of clusters.

Agglomerative hierarchical clustering is an alternative approach that does not require specifying the number of clusters to be generated. We calculated the distance between hierarchical clusters with complete linkage: d(C, C′) = max{d(a, b)|a ∈ C, b ∈ C′}.

The result of hierarchical clustering is a tree-based representation of the objects; the visualisation is shown through a graph known as a dendrogram.

Support Vector Machine

Support Vector Machine (SVM) is a well-known supervised learning method. The basic configuration works with two classes and considers linearly separable problems: data points that a hyperplane can separate. Considering data as n-dimensional labelled points, SVM searches the hyperplane separating points of different classes, maximising the distance between the boundary and data points. We use SVM as implemented in the Scikit learn library [42].

To evaluate the SVM prediction ability, we use cross-validation. Roughly speaking, given a dataset of known data (the training set), k-fold cross-validation divides the dataset into k partitions and, for k times, one partition at a time is used as a validation set, while the rest of the dataset remains as the training set. Prediction indexes are measured at each iteration and, in the end, the mean value of such measures will be the final prediction results. A modified version called stratified k-fold cross-validation allows for dealing with imbalanced datasets by preserving the imbalanced class distribution in each fold, thereby ensuring that each class is suitably represented. In this work we use stratified 5-fold cross validation.

Finally, we use balanced accuracy as a measure for prediction model evaluation, since it avoids inflated performance estimates on imbalanced datasets. It is defined as the fraction of the correct predictions, where each sample is weighted according to the inverse prevalence of its true class. Its values range between 0 and 1, with the best value 1. In our tests, however, we report the value as a percentage.

Phylogenetic tree comparison

For the phylogenetic reconstruction of the Pseudomonas genus, the 121 complete genomes present in KEGG were downloaded from the NCBI database. Open reading frames (ORFs) were extracted, and all the nontruncated proteins longer than 50 residues were kept for the following analysis. Orthologs present in a single copy in all the genomes were then selected using OrthoFinder (ver 2.3.3) [43]. Three hundred and thirty proteins were kept for phylogenetic analysis. The amino acid sequences of the coding sequences (CDS) belonging to each ortholog family were aligned using MUSCLE (ver 3.8.31) [44]. For each of the aligned CDS, an evolutionary model was predicted using PartitionFinder 2 (ver 2.1.1) [45]; all the protein sequences were then concatenated and their phylogenetic tree constructed with iQtree 2 (ver 2.1.2) [46] using a partitioned Maximum Likelihood model that takes into account the evolutionary model predicted for each of the CDS. The phylogenetic trees were tested with 1,000 ultrafast bootstraps [47].

The distance matrices generated during the previous analyses were used to generate cladograms using the hierarchical clustering and converted to a phylogenetic object to be compared to the phylogenetic tree using the R package ape (ver 5.5) [40, 48]. To make it easier to visually compare the phylogenetic tree and the clustering based on the generated distance matrices, the connecting lines were coloured. The colours were assigned based on the position of the clusters within the phylogenetic tree. In particular, blue was assigned to the deepest branching cluster followed, in order, by turquoise, green, yellow, orange and finally red, which was assigned to the most recent branching cluster.

Results

This section presents the results of comparing the metabolisms of different species, represented as AMNs, using graph kernels. The various experiments were carried out on the entire dataset as well as on different subsets of organisms. In particular, in the first experiment, we compared the metabolisms of the entire dataset, consisting of 7141 organisms, 687 Eukaryotes and 6454 Prokaryotes, stored in KEGG. In the second experiment, we considered all the Eukaryotes and later, we also focused on each kingdom separately: Animals, Plants and, by extension, Protophotophytes, Fungi and Protists. The third test focused on Prokaryotes.

The larger experiment, considering the whole dataset, ran in less than 86 seconds for VH and WL kernels, less than 7 minutes for SP and about 34 minutes for PM on a standard laptop, which reinforces the argument that AMNs and graph kernels are effective, from a computational point of view, to perform comparative analysis of large metabolomic datasets. The smaller experiments, i.e. Animals, Protophotophytes, Fungi and Protists, ran in a few seconds for all kernels.

First experiment: Whole dataset analysis

This experiment considers the metabolism of 7141 organisms stored in the KEGG database. The set comprises 370 Animals, 127 Protophotophytes, 138 Fungi, 52 Protists, 6115 Bacteria and 339 Archaea. The AMNs of all organisms were pairwise compared using the VH, SP, WL and PM graph kernels.

Fig 2 shows the MDS plot of the VH similarity matrix. Looking at the plot, it is evident that the VH graph kernel is able to separate Eukaryotes and Prokaryotes. Within Eukaryotes, the various kingdoms can also be distinguished and Animals are clearly separated from the rest of the Eukaryotes. Instead, within Prokaryotes, Bacteria and Archaea cannot be clearly distinguished. A pattern similar to the one shown in Fig 2 can also be observed in the similarity matrices of the SP, WL and PM graph kernels, whose heatmaps and MDS plots can be found in the Supporting Information, see S4 File. However, VH for this experiment shows the best separation between domains and between Animals and the rest of the Eukaryotes kingdoms, suggesting that the simple comparison performed by the kernel gives already a clear signal at this level.

Download:

Fig 2. MDS plot of the VH similarity matrix of the whole dataset at the kingdom level.

https://doi.org/10.1371/journal.pone.0281047.g002

We also applied k-means clustering to the MDS data. The MDS transformation allows for a faithful representation of the original data and, at the same time, reduces the noise of the data themselves. Table 1 shows the 6-means clustering results (6 being the different kingdoms) obtained from the MDS data. Notice that all kernels separate Eukaryotes and Prokaryotes. In particular, Eukaryotes fall into two clusters, one homogeneous cluster for Animals and the other for the remaining kingdoms (up to some outliers in Protists or Fungi). At the same time, Prokaryotes are divided into the remaining clusters without a clear separation between Bacteria and Archaea. Further details on the organisms belonging to each cluster can be found in the Supporting information, see S4 File.

Download:

Table 1. Whole dataset experiment, kingdom level: Clustering results for kernels VH, SP, WL and PM starting from the MDS data.

https://doi.org/10.1371/journal.pone.0281047.t001

However, at the kingdom level, only Animals are clearly separated, meaning that, within this wide dataset of organisms, their AMNs are significantly different w.r.t. to the others. It is also interesting to observe that the WL kernel is the only kernel able to cluster almost all Archaea within one group (cluster 1).

The SVM prediction analysis applied to the kernel data, see Table 2, improves the clustering results. Looking on the left side of the table (Discerning Eukaryotes/Prokaryotes), the high values of the balanced accuracy, above 98% for all kernels, confirms their ability in separating organisms based on the presence or absence of a nucleus. Furthermore, looking at column three of Table 2 (Discerning Kingdoms), it turns out that SVM analysis separated the different organisms also at the kingdom level, accurately assigning each organism to its kingdom, being the evaluation measure above 97% in all cases.

Download:

Table 2. Whole dataset test: Evaluation of the SVM results for discerning Eukaryotes/Prokaryotes and Kingdoms.

https://doi.org/10.1371/journal.pone.0281047.t002

Second experiment: Eukaryotes analysis

In this section, we present the analyses concerning the Eukaryotes domain. We first consider all the 687 Eukaryotes together and test whether AMNs, compared through the VH, SP, WL and PM graph kernels, allow for discerning the various kingdoms: Animals, Plants (more precisely Protophotophytes), Fungi and Protists. Subsequently, we consider each kingdom separately and evaluate the extent to which they can differentiate the organisms based on their respective phylum or class. We do not perform SVM prediction for these lower taxonomy level tests since some classes may contain a few organisms (sometimes just one), rendering the cross-validation process unusable.

Discerning the eukaryotic kingdoms.

In this experiment, we consider all the Eukaryotes, represented through their AMNs, and use the VH, SP, WL and PM graph kernels to test if the Eukaryotes kingdoms can be distinguished. Fig 3 shows the heatmaps obtained by the four kernels for this test. We remind that, in the heatmaps colours range from dark blue, meaning high similarity, to yellow, indicating low similarity. The coloured bars beside rows and columns show the different kingdoms: Animals in red, Protophotophytes in green, Fungi in orange and Protists in black.

Download:

Fig 3. Heatmaps of the similarity matrices of KEGG Eukaryotes at the kingdom level: (a) Vertex Histogram; (b) Shortest Path; (c) Weisfeiler-Lehman; (d) Pyramid Match.

https://doi.org/10.1371/journal.pone.0281047.g003

We observe a good separation among Animals, Protophotophytes, and Fungi in all heatmaps. Protists are apart, but they are generally closer to Fungi than to the rest of Eukaryotes. Also, in all heatmaps, Animals are well separated from the other kingdoms and show a further internal pattern that deserves further analysis. It is also worth noticing that all kernels highlight a pronounced difference between Animals and Protophotophytes. Animals are closer to Fungi than Protophotophytes, and, in turn, Protophotophytes are closer to Fungi than to Animals. The heatmaps view of the similarity matrices is also confirmed by the MDS plots that can be found in the Supporting Information, see S5 File.

Looking again at the heatmaps in Fig 3 we can also observe that the VH’s heatmap offers a more definite and compact view of the various taxonomic groups, while the other three kernels show higher intra- and intergroup variability. For instance, the internal pattern within the Animal kingdom is less evident in VH than in the other kernels. Such higher variability indicates a major ability in capturing the differences, which may also result in noise in identifying and separating the four kingdoms. Clearly, the different abilities of the four kernels are strictly related to the different features they involve in the comparison.

We applied a 4-means clustering analysis to the kernels results, four being the different kingdoms of the organisms considered in this test. Table 3 shows how the organisms of the four kingdoms are distributed into four clusters. Details on the organisms belonging to each cluster can be found in the Supporting Information, see S5 File.

Download:

Table 3. Eukaryotes test, kingdom level: Distribution of Eukaryotes into four groups for every graph kernel.

https://doi.org/10.1371/journal.pone.0281047.t003

Looking at Table 3 we can observe that Animals form homogeneous groups in all kernels, one cluster for VH, SP and PM (up to a few outliers), and two clusters for WL, meaning that for WL distinguishing the two subgroups of Animals is more relevant than distinguishing the other kingdoms. Morevover, in WL Animals are correctly split into Vertebrates and Invertebrates, up to a few outliers. Also, Protophotophytes form a big homogeneous group in all kernels, the outliers being Green and Red algae (i.e. unicellular organisms), which fall into another cluster in VH, WL and PM. Fungi are also grouped in one cluster by all kernels, with a few outliers (Microsporidians). However, Fungi always share their cluster with Protophotophytes (algae) and/or Protists. Accordingly, Protists are always mixed with Fungi and Protophotophytes (algae) and even split into two clusters in VH and PM, and three in SP.

Although all kernels give an interesting and meaningful classification, VH, SP and PM perform better at kingdom level: Animals, Protophotophytes and Fungi have their own cluster while Protists are split and mixed with Fungi and algae. However, WL has the advantage of keeping all Protists together in one group and offer the interesting viewpoint of evaluating more the Animals differences (Vertebrates vs. Invertebrates) than the kingdom boundaries.

The SVM prediction analysis also supports these results. In Table 4 we report the balanced accuracy obtained by each kernel: accuracy is above 95% in all cases.

Download:

Table 4. Eukaryotes test: Evaluation of SVM prediction results for discerning kingdoms.

https://doi.org/10.1371/journal.pone.0281047.t004

Discerning animals.

This experiment focuses on the Animals kingdom and tests whether AMNs, compared through the considered kernels methods, are useful to classify organisms at a lower taxonomic level, specifically at the Phylum level. Fig 4 shows the heatmap of the PM similarity matrix (on the left) and the corresponding 4-means clustering (on the right), k = 4 being the optimal number of clusters, as suggested by the Elbow method [39].

Download:

Fig 4. Heatmap of the PM similarity matrix of all KEGG Animals (left) and composition of clusters w.r.t. a total of four clusters for the PM graph kernel.

https://doi.org/10.1371/journal.pone.0281047.g004

Looking at the heatmap in Fig 4, it is evident that the PM kernel does not allow for distinguishing the various Phyla, highlighted in the matrix by the two coloured bars on rows and columns. However, it separates Vertebrates and Invertebrates, and, within Invertebrates, almost all Arthropods are discernible. Finally, it can be noticed that two organisms appear to be far apart from all other Animals (see the light lines in the bottom part of the rows and, by symmetry, right part of the columns). This could be due to (temporarily) incomplete data for those organisms in the KEGG database. The k-means results reflect the main pattern of the heatmap. There are two homogeneous clusters: cluster 1 containing all and only Vertebrates and cluster 4 containing 97 Arthropods (almost all KEGG Insects) out of 119. Cluster 3 includes the remaining Animals, up to the two outliers that fall into cluster 2.

The other kernels also confirm the results shown here for the PM kernel. We refer to the Supporting Information for the VH, SP and WL heatmaps, MDS and k-means results, see S6 File.

In order to investigate whether AMNs are able to classify at a lower rank of taxonomy, we performed a further test focusing on Vertebrates only. The heatmaps obtained by the four kernels can be found in the Supporting Information, see S7 File. While VH, SP and PM do not provide any meaningful classification, WL is able to distinguish two main groups, roughly corresponding to Fishes and Mammals.

It is noteworthy that, at this level of taxonomy, the four kernels have different classification abilities. Nonetheless, there is a trend that using more features, like WL and PM, allows for a relatively better ability to differentiate between the groups. But, as evidenced by the different performance of the two kernels, the features considered by the kernels also affect their classification ability.

Discerning protophotophytes.

This experiment considers all the KEGG “Plants” (more precisely Protophotophytes). Their AMNs are compared using the four kernels to see whether the internal structure of the kingdom can be recognised. Fig 5 shows the MDS plot (on the left) and 3-means clustering (on the right) resulting from the SP graph kernel. The MDS highlights that Algae (red and green dots) are well separated from the other Protophotophytes, which instead appear to be rather mixed together. The k-means clustering also confirms this pattern: cluster 2 contains all Algae, while the other plants fall into the other two clusters without a clear separation. The other kernels show similar results, as illustrated in the Supporting Information, see S8 File.

Download:

Fig 5. MDS plot of the SP similarity matrix of all KEGG Protophotophytes (left) and composition of clusters w.r.t. a total of three clusters for the SP graph kernel.

https://doi.org/10.1371/journal.pone.0281047.g005

Discerning Fungi.

This experiment considers the KEGG Fungi and examines whether their AMNs, compared through the kernel methods, can identify further differences inside the kingdom. Fig 6 shows the MDS plot (on the left side) obtained by the WL kernel and the Fungi classification at the phylum level (on the right side). We can observe that Microsporidians (red dots) are separated from the other taxa and that Ascomycetes (blue dots) form two separated groups.

Download:

Fig 6. MDS plot of the WL similarity matrix of the KEGG Fungi (left) and composition of clusters w.r.t. a total of three clusters for the WL graph kernel.

https://doi.org/10.1371/journal.pone.0281047.g006

On the right panel of Fig 6, the 3-means clustering confirms the MDS observations. Microsporidians form a homogeneous group (cluster 2) and Ascomycetes are divided in two different groups. Cluster 1 contains all Basidiomycetes and part of the Ascomycetes (namely Saccharomycetes, a class of Ascomycetes), while the rest of Ascomycetes fall in cluster 3. The other graph kernels show similar results, we refer to the Supporting Information for further details, see S9 File.

Discerning Protists.

In this last experiment on the Eukaryotes domain, we consider the KEGG Protists. Fig 7 shows the heatmap (left) and k-means classification (right) obtained by the PM kernel.

Download:

Fig 7. Heatmap of the PM similarity matrix of the KEGG Protists (left) and composition of clusters w.r.t. a total of four clusters for the PM graph kernel.

https://doi.org/10.1371/journal.pone.0281047.g007

We observe that the Protists similarity matrix does not contain a clear pattern. This is also reflected in the k-means clustering, where all Euglenozoa are grouped in a homogeneous cluster (cluster 1). This is true also for part of the Alveolates (cluster 4), but the remaining ones are divided into clusters 2 and 3. The other kernels show similar results, as illustrated in the Supporting Information, see S10 File.

Third experiment: Prokaryotes analysis

When looking at Prokaryotes in Fig 2, showing the MDS plot of the whole dataset of organisms, it is evident that no clear pattern separates Bacteria from Archaea. In addition, even within Bacteria, no clear pattern separating the different clades at specific taxonomic levels (i.e., Phylum, order) were observed, as reflects Fig 8, which shows the MDS plot of the SP graph kernel for a selected set of Bacteria (i.e. all the Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria and Tenericutes). Looking at Fig 8, only Tenericutes appear to be slightly recognizable w.r.t the other clades. Similar plots were obtained by the other graph kernels, the complete analysis can be found in the Supporting Information, see S11 File.

Download:

Fig 8. MDS plot of a selected set of bacteria at phylum level, obtained by the SP graph kernel.

https://doi.org/10.1371/journal.pone.0281047.g008

Pseudomonas analysis at species level.

In order to analyse the usefulness of AMNs at a lower rank of taxonomy, we considered the bacteria in the genus Pseudomonas. More precisely, we compare directly the gene-based phylogeny of the 120 Pseudomonas strain available in KEGG with the dendrogram resulting from applying the complete-linkage hierarchical clustering to the kernels similarity matrices. The best result is shown in Fig 9: the gene-based phylogeny (below) is compared with the WL kernel dendrogram (above).

Download:

Fig 9. Comparison between Pseudomonas phylogeny and complete-linkage hierarchical clustering obtained with the WL graph kernel.

The lines colour legend is as follows. The colours were assigned based on the position of the cluster within the phylogenetic tree. Blue was assigned to the deepest branching cluster followed in order by turquoise, green, yellow, orange and, finally, red which was assigned to the most recent branching cluster.

https://doi.org/10.1371/journal.pone.0281047.g009

Looking at Fig 9, we observe that different clusters are relatively well-matched between the phylogeny and the clustering derived from the WL graph kernel. The species cluster of P. aeruginosa (turquoise lines) is relatively well-preserved in the complete linkage dendrogram, and the same is for clusters comprising respectively P. stutzeri (yellow lines) and P. putida (red lines next to the yellow ones).

The VH, SP and PM kernels results can be found in the Supporting Information, see S12 File. The various kernels perform differently in this experiment and even for the best case shown here, the comparison did not result in a perfect match between the WL dendrogram and the phylogeny. It is also noteworthy that VH gives the worst result in this test. This supports the fact that, at lower taxonomical levels, looking only at nodes and node labels does not allow for a meaningful comparison.

Discussion

In this section, we discuss the results of the various experiments reported in the previous section from a biological and evolutionary viewpoint. We are aware that our analyses could be biased due to the use of KEGG. For instance, KEGG annotation and pathways reconstruction, and also how some pathways are included may be artefactual because of how KEGG assign genes/reactions to pathways. Nonetheless, the general agreement of all graph kernels on the results of the performed experiments suggests that the signal coming out of the metabolisms’ comparison is robust. In addition, we repeated the analyses after removing the pathways harbouring only one (analysis with threshold 1) or at maximum two reactions (analysis with threshold 2), see S13 File. Such changes affected the results of the experiments only in a marginal manner, and the clustering patterns were highly similar.This further confirms the approach’s robustness.

Graph kernels take as input two graphs and provide as output a real number measuring their similarity. They do not allow recovering the nodes and edges of the input graphs that are responsible for the similarity result. As such, the nature of the analyses does not allow us to identify the specific pathways within the AMNs that drive the clustering we observe. Nonetheless, it is still possible to link the clustering to macroevolutionary events. In this respect, as discussed below, it turns out that the clustering pattern is meaningful from the biological and evolutionary viewpoints, thus suggesting that both the AMN representation and the applied computational techniques are appropriate for comparing and analyzing the organisms’ metabolism. We proceed now by discussing the results obtained in every experiment reported in the previous section.