Space-efficient construction of compressed suffix trees
Section snippets
Introduction and related work
The increasingly-growing production of large string collections—especially in domains such as biology, where new generation sequencing technologies can nowadays generate gigabytes of data in few hours—is lately generating much interest towards fast and space-efficient algorithms able to index this data. The Burrows-Wheeler Transform [1] and its extension to sets of strings [2], [3] is becoming the gold-standard in the field: even when not compressed, its size is asymptotically smaller than
Basic concepts
Let be a finite ordered alphabet of size σ with , where < denotes the standard lexicographic order. Given a text we denote by its length n. We assume that the input text is terminated by the special symbol (terminator) #, which does not appear elsewhere in T. We use ϵ to denote the empty string. A factor (or substring) of T is written as with . When declaring an array A, we use the same notation to indicate that the array
Belazzougui's enumeration algorithm
In [24], Belazzougui showed that a BWT with rank and range distinct functionality (see Section 2) is sufficient to enumerate in small space a rich representation of the internal nodes of the suffix tree of a text T. For the purposes of this article, we assume that the BWT is represented using a wavelet tree (whereas Belazzougui's original result is more general), and thus that all queries take time.
Theorem 1 Given the Burrows-Wheeler Transform of a text represented with a wavelet tree,Belazzougui [24]
Beller et al.'s algorithm
The second ingredient used in our solutions is the following result, due to Beller et al. (we slightly re-formulate their result to fit our purposes, read below for a description of the differences):
Theorem 2 Given the Burrows-Wheeler Transform of a text T represented with a wavelet tree, we can enumerate all pairs in time using 5n bits of working space on top of the BWT.Beller et al. [27]
Theorem 2 represents the state of the art for computing the LCP array from the BWT. Also Beller et al.'s algorithm
Enumerating LCP values
In this section we prove our first main result: how to enumerate LCP pairs using succinct working space on top of a wavelet tree representing the BWT. Later we will use this procedure to build the LCP and PLCP arrays in small space on top of a plain representation of the BWT. We give our lemma in the general form of string collections, which will require adapting the algorithms seen in the previous sections to this more general setting. Our first observation is that Theorem 1,
Enumerating suffix tree intervals
In this section we show that the procedures described in Section 5 can be used to enumerate all suffix tree intervals—that is, the suffix array intervals of all right-maximal text substrings—taking as input the BWT of a text. Note that in this section we consider just simple texts rather than string collections as later we will use this procedure to build the compressed suffix tree of a text.
When , we can directly use Belazzougui's procedure (Theorem 1), which already solves the
Building the PLCP bitvector
The PLCP array is defined as , and can thus be used to retrieve LCP values as (note that this requires accessing the suffix array). Kasai et al. showed in [16] that PLCP is almost increasing: . This allows representing it in small space as follows. Let denote the bitvector having a bit set at each position , for (and 0 in all other positions). Since , the quantity is different for
Building the suffix tree topology
In order to build the suffix tree topology we use a strategy analogous to the one proposed by Belazzougui [24]. The main observation is that, given a procedure that enumerates suffix tree intervals, for each interval we can increment a counter and a counter , where Open and Close are integer vectors of length n. Then, the BPS representation of the suffix tree topology can be built by scanning left-to right the two arrays and, for each , append open
Merging BWTs in small space
In this section we use our space-efficient BWT-navigation strategies to tackle an additional problem: to merge the BWTs of two string collections. In [24], [32], Belazzougui et al. show that Theorem 1 can be adapted to merge the BWTs of two texts and obtain the BWT of the collection in time and bits of working space for any [32, Thm. 7]. We show that our strategy enables a more space-efficient algorithm for the task of merging BWTs of collections.
Implementation and experimental evaluation
We implemented our LCP construction and BWT merge algorithms on DNA alphabet in https://github.com/nicolaprezza/bwt2lcp using the language C++. Due to the small alphabet size, it was actually sufficient to implement our extension of Belazzougui's enumeration algorithm (and not the strategy of Beller et al., which becomes competitive only on large alphabets). The repository features a new packed string on DNA alphabet using 4 bits per character and able to compute the quintuple
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
GR is partially, and NP is totally, supported by the project MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L.
References (44)
- et al.
An extension of the burrows-wheeler transform
Theor. Comput. Sci.
(2007) - et al.
Lightweight algorithms for constructing and inverting the BWT of string collections
Theor. Comput. Sci.
(2013) Fast BWT in small space by blockwise suffix sorting
Theor. Comput. Sci.
(2007)- et al.
Lightweight LCP construction for very large collections of strings
J. Discret. Algorithms
(2016) - et al.
An improved algorithm for the all-pairs suffix-prefix problem
J. Discret. Algorithms
(2016) - et al.
Computing the longest common prefix array based on the Burrows–Wheeler transform
J. Discret. Algorithms
(2013) Wavelet trees for all
J. Discret. Algorithms
(2014)- et al.
The wavelet matrix: an efficient wavelet tree for large alphabets
Inf. Syst.
(2015) - et al.
A Block Sorting data Compression Algorithm
(1994) - et al.
Average linear time and compressed space construction of the Burrows-Wheeler transform