Abstract
Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size \(\gamma \) of the smallest string attractor, was introduced. The measure \(\gamma \) lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space \(O(\gamma \log \frac{n}{\gamma })\), which also upper bounds most measures (including z). While \(\gamma \) is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no \(o(\gamma \log n)\)-space representation of strings is known. In this paper, we study a smaller measure, \(\delta \le \gamma \), which can be computed in linear time. We show that \(\delta \) better captures the compressibility of repetitive strings. For every length n and every value \(\delta \ge 2\), we construct a string such that \(\gamma = \varOmega (\delta \log \frac{n}{\delta })\). Still, we show a representation of any string S in \(O(\delta \log \frac{n}{\delta })\) space that supports direct access to any character S[i] in time \(O(\log \frac{n}{\delta })\) and finds the occ occurrences of any pattern \(P[1{.\,.}m]\) in time \(O(m\log n + occ\log ^\varepsilon n)\) for any constant \(\varepsilon >0\). Further, we prove that no \(o(\delta \log n)\)-space representation exists: for every length n and every value \(2\le \delta \le n^{1-\varepsilon }\), we exhibit a string family whose elements can only be encoded in \(\varOmega (\delta \log \frac{n}{\delta })\) space. We complete our characterization of \(\delta \) by showing that, although \(\gamma \), z, and other repetitiveness measures are always \(O(\delta \log \frac{n}{\delta })\), for strings of any length n, the smallest context-free grammar can be of size \(\varOmega (\delta \log ^2 n/\log \log n)\). No such separation is known for \(\gamma \).
Keywords
Part of this work was carried out during the Dagstuhl Seminar 19241, “25 Years of the Burrows–Wheeler Transform”.
T. Kociumaka—Supported by ISF grants no. 1278/16, 824/17, and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM (no. 683064) under the EU’s Horizon 2020 Research and Innovation Programme.
G. Navarro—Supported in part by Fondecyt grant 1-170048, Chile; Millennium Institute for Foundational Research on Data (IMFD), Chile.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Throughout the paper, the size of data structures is measured in machine words.
- 2.
The most recent index [13] locates patterns in \(O(m + (occ+1)\log ^\epsilon n)\) time and \(O(\gamma \log \frac{n}{\gamma })\) space (being thus faster but still using more space).
- 3.
If not, we simply pad S with spurious symbols at the end; whole spurious blocks are not represented. The extra space incurred is only O(rh) for a block tree of height h. The actual construction [5] uses instead blocks of sizes \(\lfloor n/s\rfloor \) and \(\lceil n/s \rfloor \).
References
Belazzougui, D., Cording, P.H., Puglisi, S.J., Tabei, Y.: Access, rank, and select in grammar-compressed strings. In: Bansal, N., Finocchi, I. (eds.) ESA 2015. LNCS, vol. 9294, pp. 142–154. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48350-3_13
Belazzougui, D., Cunial, F.: Fast label extraction in the CDAWG. In: Fici, G., Sciortino, M., Venturini, R. (eds.) SPIRE 2017. LNCS, vol. 10508, pp. 161–175. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67428-5_14
Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching, CPM, pp. 7:1–7:13 (2017). https://doi.org/10.4230/LIPIcs.CPM.2017.7
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_3
Belazzougui, D., et al.: Queries on LZ-bounded encodings. In: Proceedings of the 2015 Data Compression Conference, DCC, pp. 83–92 (2015). https://doi.org/10.1109/DCC.2015.69
Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time-space trade-offs for Lempel-Ziv compressed indexing. Theor. Comput. Sci. 713, 66–77 (2018). https://doi.org/10.1016/j.tcs.2017.12.021
Bille, P., Gørtz, I.L., Sach, B., Vildhøj, H.W.: Time-space trade-offs for longest common extensions. J. Discrete Algorithms 25, 42–50 (2014). https://doi.org/10.1016/j.jda.2013.06.003
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015). https://doi.org/10.1137/130936889
Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987). https://doi.org/10.1145/28869.28873
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994). https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf
Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005). https://doi.org/10.1109/TIT.2005.850116
Christiansen, A.R., Ettienne, M.B.: Compressed indexing with signature grammars. In: Bender, M.A., Farach-Colton, M., Mosteiro, M.A. (eds.) LATIN 2018. LNCS, vol. 10807, pp. 331–345. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77404-6_25
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. arXiv e-prints p. 1811.12779 (2019). https://arxiv.org/abs/1811.12779
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inform. 111(3), 313–337 (2011). https://doi.org/10.3233/FI-2011-565
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28332-1_21
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54423-1_63
Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of Lempel-Ziv parsing. In: Bender, M.A., Farach-Colton, M., Mosteiro, M.A. (eds.) LATIN 2018. LNCS, vol. 10807, pp. 490–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77404-6_36
Gagie, T., Navarro, G., Prezza, N.: Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), 1–54 (2020). https://doi.org/10.1145/3375890
Gallant, J.K.: String Compression Algorithms. Ph.D. thesis, Princeton Univ. (1982)
Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016). https://doi.org/10.1016/j.tcs.2015.12.032
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987). https://doi.org/10.1147/rd.312.0249
Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler transform conjecture. arXiv e-prints p. 1910.10631 (2019). https://arxiv.org/abs/1910.10631
Kempa, D., Prezza, N.: At the roots of dictionary compression: String attractors. In: Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC, pp. 827–840 (2018). https://doi.org/10.1145/3188745.3188814
Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theor. Comput. Sci. 298(1), 253–272 (2003). https://doi.org/10.1016/S0304-3975(02)00426-7
Kieffer, J.C., Yang, E.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000). https://doi.org/10.1109/18.841160
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968). https://doi.org/10.1080/00207166808803030
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013). https://doi.org/10.1016/j.tcs.2012.02.006
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976). https://doi.org/10.1109/TIT.1976.1055501
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010). https://doi.org/10.1089/cmb.2009.0169
Navarro, G., Ochoa, C., Prezza, N.: On the approximation ratio of ordered parsings. arXiv e-prints p. 1803.09517 (2019). https://arxiv.org/abs/1803.09517
Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, New York (2016). https://doi.org/10.1017/cbo9781316588284
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 1 (2007). https://doi.org/10.1145/1216370.1216372
Navarro, G., Prezza, N.: Universal compressed text indexing. Theor. Comput. Sci. 762, 41–50 (2019). https://doi.org/10.1016/j.tcs.2018.09.007
Nishimoto, T., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Fully dynamic data structure for LCE queries in compressed space. In: Proceedings of the 41st International Symposium on Mathematical Foundations of Computer Science, MFCS, pp. 72:1–72:15 (2016). https://doi.org/10.4230/LIPIcs.MFCS.2016.72
Prezza, N.: Optimal rank and select queries on dictionary-compressed text. In: Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching, CPM, pp. 4:1–4:12 (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.4
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica 65(3), 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6
Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981). https://doi.org/10.1145/322234.322237
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003). https://doi.org/10.1016/S0304-3975(02)00777-6
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 398–403 (1948). https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLOS Biol. 13(7), e1002195 (2015). https://doi.org/10.1371/journal.pbio.1002195
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kociumaka, T., Navarro, G., Prezza, N. (2020). Towards a Definitive Measure of Repetitiveness. In: Kohayakawa, Y., Miyazawa, F.K. (eds) LATIN 2020: Theoretical Informatics. LATIN 2021. Lecture Notes in Computer Science(), vol 12118. Springer, Cham. https://doi.org/10.1007/978-3-030-61792-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-61792-9_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61791-2
Online ISBN: 978-3-030-61792-9
eBook Packages: Computer ScienceComputer Science (R0)