Elsevier

Theoretical Computer Science

Volume 852, 8 January 2021, Pages 138-156
Theoretical Computer Science

Space-efficient construction of compressed suffix trees

https://doi.org/10.1016/j.tcs.2020.11.024Get rights and content

Abstract

We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let n be the text length and σ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in O(nlogσ) time using just o(nlogσ) bits of working space on top of the input re-writable BWT. Using these algorithms as building blocks, for any parameter 0<ϵ1 we show how to build the PLCP bitvector and the balanced parentheses representation of the suffix tree topology in O(n(logσ+ϵ1loglogn)) time using at most nlogσ(ϵ+o(1)) bits of working space on top of the input re-writable BWT and the output. For example, we can build a compressed suffix tree from the BWT using just succinct working space (i.e. o(nlogσ) bits) and Θ(nlogσ+n(loglogn)1+δ) time, for any constant δ>0. This improves the previous most space-efficient algorithms, which worked in O(n) bits and O(nlogn) time. We also consider the problem of merging BWTs of string collections, and provide a solution running in O(nlogσ) time and using just o(nlogσ) bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as n bits on top of a packed representation of the input/output and process data as fast as 2.92 megabases per second.

Section snippets

Introduction and related work

The increasingly-growing production of large string collections—especially in domains such as biology, where new generation sequencing technologies can nowadays generate gigabytes of data in few hours—is lately generating much interest towards fast and space-efficient algorithms able to index this data. The Burrows-Wheeler Transform [1] and its extension to sets of strings [2], [3] is becoming the gold-standard in the field: even when not compressed, its size is asymptotically smaller than

Basic concepts

Let Σ={c1,c2,,cσ} be a finite ordered alphabet of size σ with #=c1<c2<<cσ, where < denotes the standard lexicographic order. Given a text T=t1t2tnΣ we denote by |T| its length n. We assume that the input text is terminated by the special symbol (terminator) #, which does not appear elsewhere in T. We use ϵ to denote the empty string. A factor (or substring) of T is written as T[i,j]=titj with 1ijn. When declaring an array A, we use the same notation A[1,n] to indicate that the array

Belazzougui's enumeration algorithm

In [24], Belazzougui showed that a BWT with rank and range distinct functionality (see Section 2) is sufficient to enumerate in small space a rich representation of the internal nodes of the suffix tree of a text T. For the purposes of this article, we assume that the BWT is represented using a wavelet tree (whereas Belazzougui's original result is more general), and thus that all queries take O(logσ) time.

Theorem 1

Belazzougui [24]

Given the Burrows-Wheeler Transform of a text T[1,σ]n represented with a wavelet tree,

Beller et al.'s algorithm

The second ingredient used in our solutions is the following result, due to Beller et al. (we slightly re-formulate their result to fit our purposes, read below for a description of the differences):

Theorem 2

Beller et al. [27]

Given the Burrows-Wheeler Transform of a text T represented with a wavelet tree, we can enumerate all pairs (i,LCP[i]) in O(nlogσ) time using 5n bits of working space on top of the BWT.

Theorem 2 represents the state of the art for computing the LCP array from the BWT. Also Beller et al.'s algorithm

Enumerating LCP values

In this section we prove our first main result: how to enumerate LCP pairs (i,LCP[i]) using succinct working space on top of a wavelet tree representing the BWT. Later we will use this procedure to build the LCP and PLCP arrays in small space on top of a plain representation of the BWT. We give our lemma in the general form of string collections, which will require adapting the algorithms seen in the previous sections to this more general setting. Our first observation is that Theorem 1,

Enumerating suffix tree intervals

In this section we show that the procedures described in Section 5 can be used to enumerate all suffix tree intervals—that is, the suffix array intervals of all right-maximal text substrings—taking as input the BWT of a text. Note that in this section we consider just simple texts rather than string collections as later we will use this procedure to build the compressed suffix tree of a text.

When σn/log2n, we can directly use Belazzougui's procedure (Theorem 1), which already solves the

Building the PLCP bitvector

The PLCP array is defined as PLCP[i]=LCP[ISA[i]], and can thus be used to retrieve LCP values as LCP[i]=PLCP[SA[i]] (note that this requires accessing the suffix array). Kasai et al. showed in [16] that PLCP is almost increasing: PLCP[i+1]PLCP[i]1. This allows representing it in small space as follows. Let plcp[1,2n] denote the bitvector having a bit set at each position PLCP[i]+2i, for i=1,,n (and 0 in all other positions). Since PLCP[i+1]PLCP[i]1, the quantity PLCP[i]+2i is different for

Building the suffix tree topology

In order to build the suffix tree topology we use a strategy analogous to the one proposed by Belazzougui [24]. The main observation is that, given a procedure that enumerates suffix tree intervals, for each interval [l,r] we can increment a counter Open[l] and a counter Close[r], where Open and Close are integer vectors of length n. Then, the BPS representation of the suffix tree topology can be built by scanning left-to right the two arrays and, for each i=1,,n, append Open[i] open

Merging BWTs in small space

In this section we use our space-efficient BWT-navigation strategies to tackle an additional problem: to merge the BWTs of two string collections. In [24], [32], Belazzougui et al. show that Theorem 1 can be adapted to merge the BWTs of two texts T1,T2 and obtain the BWT of the collection {T1,T2} in O(nk) time and nlogσ(1+1/k)+11n+o(n) bits of working space for any k1 [32, Thm. 7]. We show that our strategy enables a more space-efficient algorithm for the task of merging BWTs of collections.

Implementation and experimental evaluation

We implemented our LCP construction and BWT merge algorithms on DNA alphabet in https://github.com/nicolaprezza/bwt2lcp using the language C++. Due to the small alphabet size, it was actually sufficient to implement our extension of Belazzougui's enumeration algorithm (and not the strategy of Beller et al., which becomes competitive only on large alphabets). The repository features a new packed string on DNA alphabet ΣDNA={A,C,G,T,#} using 4 bits per character and able to compute the quintuple

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

GR is partially, and NP is totally, supported by the project MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L.

References (44)

  • T. Beller et al.

    Space-efficient construction of the Burrows-Wheeler transform

  • J. Fuentes-Sepúlveda et al.

    Space-efficient computation of the Burrows-Wheeler transform

  • D. Kempa et al.

    String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure

  • G. Navarro et al.

    Optimal dynamic sequence representations

    SIAM J. Comput.

    (2014)
  • N. Prezza et al.

    Detecting mutations by eBWT

  • N. Prezza et al.

    SNPs detection by eBWT positional clustering

    Algorithms Mol. Biol.

    (2019)
  • V. Guerrini et al.

    Lightweight metagenomic classification via eBWT

  • K. Sadakane

    Succinct representations of lcp information and improvements in the compressed suffix arrays

  • T. Kasai et al.

    Linear-time longest-common-prefix computation in suffix arrays and its applications

  • R. Grossi et al.

    Compressed suffix arrays and suffix trees with applications to text indexing and string matching

    SIAM J. Comput.

    (2005)
  • K. Sadakane

    Compressed suffix trees with full functionality

    Theory Comput. Syst.

    (2007)
  • E. Ohlebusch et al.

    CST++

  • Cited by (0)

    View full text