Abstract
Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires \( O(n \log n) \) bits, n being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant.
In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time by only storing a linear number of bits. We use our technique to provide a space-time trade-off to compute matching statistics on a Wheeler DFA. In addition, we show that by augmenting the BOSS representation of a k-th order de Bruijn graph with a linear number of bits we can navigate the underlying variable-order de Bruijn graph in time logarithmic in k, thus improving a previous bound by Boucher et al. which was linear in k [DCC 2015].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alanko, J., D’Agostino, G., Policriti, A., Prezza, N.: Regular languages meet prefix sorting. In: Proceedings of the 31st Symposium on Discrete Algorithms, (SODA 2020), pp. 911–930. SIAM (2020). https://doi.org/10.1137/1.9781611975994.55
Alanko, J., D’Agostino, G., Policriti, A., Prezza, N.: Wheeler languages. Inf. Comput. 281, 104820 (2021)
Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012). https://doi.org/10.1089/cmb.2012.0021, pMID: 22506599
Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-order de Bruijn graphs. In: 2015 Data Compression Conference, pp. 383–392 (2015). https://doi.org/10.1109/DCC.2015.70
Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)
Conte, A., Cotumaccio, N., Gagie, T., Manzini, G., Prezza, N., Sciortino, M.: Computing matching statistics on Wheeler DFAs. In: 2023 Data Compression Conference (DCC), pp. 150–159 (2023). https://doi.org/10.1109/DCC55655.2023.00023
Cotumaccio, N.: Graphs can be succinctly indexed for pattern matching in \(O(\vert E\vert ^{2}+\vert V\vert ^{5/2})\) time. In: 2022 Data Compression Conference (DCC), pp. 272–281 (2022). https://doi.org/10.1109/DCC52660.2022.00035
Cotumaccio, N., Prezza, N.: On indexing and compressing finite automata. In: Proceedings of the 32nd Symposium on Discrete Algorithms, (SODA 2021), pp. 2585–2599. SIAM (2021). https://doi.org/10.1137/1.9781611976465.153
Cotumaccio, N., D’Agostino, G., Policriti, A., Prezza, N.: Co-lexicographically ordering automata and regular languages - part I. J. ACM 70, 1–73 (2023). https://doi.org/10.1145/3607471
Díaz-Domínguez, D., Gagie, T., Navarro, G.: Simulating the DNA overlap graph in succinct space. In: Pisanti, N., Pissis, S.P. (eds.) 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), vol. 128, pp. 26:1–26:20. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.26, http://drops.dagstuhl.de/opus/volltexte/2019/10497
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2005), pp. 184–193 (2005). https://doi.org/10.1109/SFCS.2005.69
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS 2000), pp. 390–398 (2000). https://doi.org/10.1109/SFCS.2000.892127
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1) (2009). https://doi.org/10.1145/1613676.1613680
Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12200-2_16
Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011). https://doi.org/10.1137/090779759
Gagie, T., Manzini, G., Sirén, J.: Wheeler graphs: a framework for BWT-based data structures. Theor. Comput. Sci. 698, 67–78 (2017). https://doi.org/10.1016/j.tcs.2017.06.016, https://www.sciencedirect.com/science/article/pii/S0304397517305285, algorithms, Strings and Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of Professor Raffaele Giancarlo)
Idury, R.M., Waterman, M.S.: A new algorithm for DNA sequence assembly. J. Comput. Biol. 2(2), 291–306 (1995)
Li, R., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–72 (2009). https://doi.org/10.1101/gr.097261.109
Liu, M., Yu, H.: Lower bound for succinct range minimum query. In: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1402–1415. STOC 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3357713.3384260
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993). https://doi.org/10.1137/0222058
Navarro, G.: Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4) (2014). https://doi.org/10.1145/2535933
Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press (2016). http://www.cambridge.org/de/academic/subjects/computer-science/algorithmics-complexity-computer-algebra-and-computational-g/compact-data-structures-practical-approach?format=HB
Peng, Yu., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA – a practical iterative de bruijn graph de novo assembler. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 426–440. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12683-3_28
Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Nat. Acad. Sci. 98(17), 9748–9753 (2001). https://doi.org/10.1073/pnas.171285098
Sadakane, K.: Compressed suffix trees with full functionality. Theor. Comp. Sys. 41(4), 589–607 (2007). https://doi.org/10.1007/s00224-006-1198-x
Simpson, J., Wong, K., Jackman, S., Schein, J., Jones, S., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009). https://doi.org/10.1101/gr.089532.108
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th IEEE Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973). https://doi.org/10.1109/SWAT.1973.13
Acknowledgements
Travis Gagie: funded by National Institutes of Health (NIH) NIAID (grant no. HG011392), the National Science Foundation NSF IIBR (grant no. 2029552) and a Natural Science and Engineering Research Council (NSERC) Discovery Grant (grant no. RGPIN-07185-2020). Dominik Köppl: supported by JSPS KAKENHI with No. JP21K17701 and JP23H04378. Nicola Prezza: funded by the European Union (ERC, REGINDEX, 101039208). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cotumaccio, N., Gagie, T., Köppl, D., Prezza, N. (2023). Space-Time Trade-Offs for the LCP Array of Wheeler DFAs. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-43980-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)