Skip to main content
Log in

LZ77 Computation Based on the Run-Length Encoded BWT

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Computing the LZ77 factorization is a fundamental task in text compression and indexing, being the size z of this compressed representation closely related to the self-repetitiveness of the text. A long-standing problem is to compute LZ77 using small working space. Considering that \(\mathcal {O}(z)\) words of space can be significantly (up to exponentially) smaller than the size n of the input text, even succinct and entropy-compressed solutions are often unduly memory demanding. In this work we focus on an important measure of text repetitiveness: the number \(r\) of equal-letter runs in the Burrows–Wheeler transform of the reversed input text. As z, the measure \(r\) is closely related to the number of repetitions in the text and can be exponentially smaller than n. We describe two algorithms computing LZ77 in \(\mathcal {O}(r\log n)\) bits of working space and \(\mathcal {O}(n\log r)\) time. Roughly speaking, our algorithms store a constant number of memory words per BWT run to keep track of first-last run-positions and a suitable indexing mechanism to sample the runs of the BWT (instead of its positions). Important consequences of our results include (i) the possibility to convert from RLBWT- to LZ77-based compressed formats without first decompressing the text, and (ii) the existence of asymptotically-optimal construction algorithms for repetition-aware self-indexes based on these compression techniques. We finally describe an implementation of our solutions and present extensive experiments on highly repetitive datasets. Our algorithms use a working space as small as 1% of the dataset size and are two to three orders of magnitude more space-efficient (albeit slower) than existing solutions based, respectively, on entropy compression and suffix arrays.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. e.g. by inserting the characters \(c_1^{\lambda _1}c_2^{\lambda _2}\dots c_r^{\lambda _r}\) in the dynamic string structure described in the previous section.

References

  1. Bannai, H., Gawrychowski, P., Inenaga, S., Takeda, M.: Converting SLP to LZ78 in almost Linear Time. In: Proceedings of 24th Annual Symposium on Combinatorial Pattern Matching, CPM 2013, Bad Herrenalb, Germany, June 17–19, 2013, vol. 7922, pp. 38–49. Springer (2013)

  2. Bannai, H., Inenaga, S., Takeda, M.: Efficient LZ78 factorization of grammar compressed text. In: Proceedings of 19th International Symposium on String Processing and Information Retrieval, SPIRE 2012, Cartagena de Indias, Colombia, October 21–25, 2012, vol. 7608, p. 86. Springer (2012)

  3. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Proceedings of CPM, pp. 26–39 (2015)

  4. Belazzougui, D., Gagie, T., Gawrychowski, P., Kärkkäinen, J., Ordónez, A., Puglisi, S.J., Tabei, Y.: Queries on LZ-Bounded Encodings. arXiv:1412.0967 (2014)

  5. Belazzougui, D., Puglisi, S.J.: Range Predecessor and Lempel–Ziv Parsing. arXiv:1507.07080 (2015)

  6. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report (1994)

  7. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  8. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inf. 111(3), 313–337 (2011)

    MathSciNet  MATH  Google Scholar 

  9. Crochemore, M., Ilie, L.: Computing longest previous factor in linear time and applications. Inf. Process. Lett. 106(2), 75–80 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  10. DYNAMIC: dynamic succinct/compressed data structures library. https://github.com/nicolaprezza/DYNAMIC. Accessed 20 May 2016

  11. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science, 2000. Proceedings, pp. 390–398. IEEE (2000)

  12. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: An alphabet-friendly FM-index. In: Proceedings of 11th International Conference on String Processing and Information Retrieval, SPIRE 2004, Padova, Italy, October 5–8, 2004, vol. 3246, p. 150. Springer (2004)

  13. Fici, G.: Factorizations of the fibonacci infinite word. J. Integer Seq. 18(2), 3 (2015)

    MathSciNet  MATH  Google Scholar 

  14. Fischer, J., Gagie, T., Gawrychowski, P.L., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: ESA 2015 LNCS 9294, p. 533

  15. Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  16. get-git-revisions: Get all revisions of a git repository. https://github.com/nicolaprezza/get-git-revisions. Accessed 20 May 2016

  17. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: 13th International Symposium on Experimental Algorithms, (SEA 2014), pp. 326–337 (2014)

  18. LZ77 factorization algorithms. https://www.cs.helsinki.fi/group/pads/lz77.html. Accessed 20 May 2016

  19. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lightweight Lempel–Ziv parsing. In: Experimental Algorithms, pp. 139–150. Springer (2013)

  20. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: simple, fast, small. In: Proceedings of 24th Annual Symposium on Combinatorial Pattern Matching, CPM 2013, Bad Herrenalb, Germany, June 17–19, 2013, vol. 7922, p. 189. Springer (2013)

  21. Kempa, D., Puglisi, S.J.: Lempel–Ziv factorization: simple, fast, practical. In: Proceedings of the Meeting on Algorithm Engineering & Expermiments, pp. 103–112. Society for Industrial and Applied Mathematics (2013)

  22. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  23. Mantaci, S., Restivo, A., Sciortino, M.: Burrows–Wheeler transform and sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  24. Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. SIAM J. Comput. 43(5), 1781–1806 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  25. Nishimoto, T., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Dynamic Index and LZ Factorization in Compressed Space. In Jan, H., Jan, Ž. (eds.) Proceedings of the Prague Stringology Conference 2016, pp. 158–170. Czech Technical University in Prague, Czech Republic (2016)

  26. Ohlebusch, E., Gog, S.: Lempel–Ziv factorization revisited. In: Proceedings of 22nd Annual Symposium on Combinatorial Pattern Matching, CPM 2011, Palermo, Italy, June 27–29, 2011, vol. 6661, p. 15. Springer (2011)

  27. pizza&chili repetitive corpus. http://pizzachili.dcc.uchile.cl/repcorpus/real/. Accessed 20 May 2016

  28. Policriti, A., Gigante, N., Prezza, N.: Average linear time and compressed space construction of the Burrows–Wheeler transform. In: Language and Automata Theory and Applications. Lecture Notes in Computer Science, vol. 8977, pp. 587–598. Springer International Publishing (2015)

  29. Policriti, A., Prezza, N.: Fast Online Lempel–Ziv Factorization in Compressed Space. In: String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 9309, pp. 13–20. Springer International Publishing (2015). doi:10.1007/978-3-319-23826-5_2

  30. Rytter, W.: Application of Lempel–Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1), 211–222 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  31. Sirén, J., et al.: Compressed full-text indexes for highly repetitive collections. PhD thesis, Helsingin yliopisto (2012)

  32. Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Proceedings of 15th International Symposium on String Processing and Information Retrieval, SPIRE 2008, Melbourne, Australia, November 10–12, 2008, vol. 5280, p. 164. Springer (2008)

  33. Takabatake, Y., Tabei, Y., Sakamoto, H.: Online self-indexed grammar compression. In Proceedings of SPIRE. Lecture Notes in Computer Science, vol. 9309, pp. 258–269. Springer International Publishing (2015)

  34. Tamakoshi, Y., Tomohiro I., Inenaga, S., Bannai, H., Takeda, M.: From run length encoding to LZ78 and back again. In: Data Compression Conference (DCC), 2013, pp. 143–152. IEEE (2013)

  35. wiki-get: Download all versions of a Wikipedia page. https://github.com/nicolaprezza/wiki_get. Accessed 20 May 2016

  36. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicola Prezza.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Policriti, A., Prezza, N. LZ77 Computation Based on the Run-Length Encoded BWT. Algorithmica 80, 1986–2011 (2018). https://doi.org/10.1007/s00453-017-0327-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-017-0327-z

Keywords

Navigation