LZ77 Computation Based on the Run-Length Encoded BWT

Policriti, Alberto; Prezza, Nicola

doi:10.1007/s00453-017-0327-z

LZ77 Computation Based on the Run-Length Encoded BWT

Published: 17 July 2017

Volume 80, pages 1986–2011, (2018)
Cite this article

Algorithmica Aims and scope Submit manuscript

832 Accesses
22 Citations
Explore all metrics

Abstract

Computing the LZ77 factorization is a fundamental task in text compression and indexing, being the size z of this compressed representation closely related to the self-repetitiveness of the text. A long-standing problem is to compute LZ77 using small working space. Considering that \(\mathcal {O}(z)\) words of space can be significantly (up to exponentially) smaller than the size n of the input text, even succinct and entropy-compressed solutions are often unduly memory demanding. In this work we focus on an important measure of text repetitiveness: the number \(r\) of equal-letter runs in the Burrows–Wheeler transform of the reversed input text. As z, the measure \(r\) is closely related to the number of repetitions in the text and can be exponentially smaller than n. We describe two algorithms computing LZ77 in \(\mathcal {O}(r\log n)\) bits of working space and \(\mathcal {O}(n\log r)\) time. Roughly speaking, our algorithms store a constant number of memory words per BWT run to keep track of first-last run-positions and a suitable indexing mechanism to sample the runs of the BWT (instead of its positions). Important consequences of our results include (i) the possibility to convert from RLBWT- to LZ77-based compressed formats without first decompressing the text, and (ii) the existence of asymptotically-optimal construction algorithms for repetition-aware self-indexes based on these compression techniques. We finally describe an implementation of our solutions and present extensive experiments on highly repetitive datasets. Our algorithms use a working space as small as 1% of the dataset size and are two to three orders of magnitude more space-efficient (albeit slower) than existing solutions based, respectively, on entropy compression and suffix arrays.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

e.g. by inserting the characters \(c_1^{\lambda _1}c_2^{\lambda _2}\dots c_r^{\lambda _r}\) in the dynamic string structure described in the previous section.

References

Bannai, H., Gawrychowski, P., Inenaga, S., Takeda, M.: Converting SLP to LZ78 in almost Linear Time. In: Proceedings of 24th Annual Symposium on Combinatorial Pattern Matching, CPM 2013, Bad Herrenalb, Germany, June 17–19, 2013, vol. 7922, pp. 38–49. Springer (2013)
Bannai, H., Inenaga, S., Takeda, M.: Efficient LZ78 factorization of grammar compressed text. In: Proceedings of 19th International Symposium on String Processing and Information Retrieval, SPIRE 2012, Cartagena de Indias, Colombia, October 21–25, 2012, vol. 7608, p. 86. Springer (2012)
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Proceedings of CPM, pp. 26–39 (2015)
Belazzougui, D., Gagie, T., Gawrychowski, P., Kärkkäinen, J., Ordónez, A., Puglisi, S.J., Tabei, Y.: Queries on LZ-Bounded Encodings. arXiv:1412.0967 (2014)
Belazzougui, D., Puglisi, S.J.: Range Predecessor and Lempel–Ziv Parsing. arXiv:1507.07080 (2015)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report (1994)
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)
Article MathSciNet MATH Google Scholar
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inf. 111(3), 313–337 (2011)
MathSciNet MATH Google Scholar
Crochemore, M., Ilie, L.: Computing longest previous factor in linear time and applications. Inf. Process. Lett. 106(2), 75–80 (2008)
Article MathSciNet MATH Google Scholar
DYNAMIC: dynamic succinct/compressed data structures library. https://github.com/nicolaprezza/DYNAMIC. Accessed 20 May 2016
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science, 2000. Proceedings, pp. 390–398. IEEE (2000)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: An alphabet-friendly FM-index. In: Proceedings of 11th International Conference on String Processing and Information Retrieval, SPIRE 2004, Padova, Italy, October 5–8, 2004, vol. 3246, p. 150. Springer (2004)
Fici, G.: Factorizations of the fibonacci infinite word. J. Integer Seq. 18(2), 3 (2015)
MathSciNet MATH Google Scholar
Fischer, J., Gagie, T., Gawrychowski, P.L., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: ESA 2015 LNCS 9294, p. 533
Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)
Article MathSciNet MATH Google Scholar
get-git-revisions: Get all revisions of a git repository. https://github.com/nicolaprezza/get-git-revisions. Accessed 20 May 2016
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: 13th International Symposium on Experimental Algorithms, (SEA 2014), pp. 326–337 (2014)
LZ77 factorization algorithms. https://www.cs.helsinki.fi/group/pads/lz77.html. Accessed 20 May 2016
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lightweight Lempel–Ziv parsing. In: Experimental Algorithms, pp. 139–150. Springer (2013)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: simple, fast, small. In: Proceedings of 24th Annual Symposium on Combinatorial Pattern Matching, CPM 2013, Bad Herrenalb, Germany, June 17–19, 2013, vol. 7922, p. 189. Springer (2013)
Kempa, D., Puglisi, S.J.: Lempel–Ziv factorization: simple, fast, practical. In: Proceedings of the Meeting on Algorithm Engineering & Expermiments, pp. 103–112. Society for Industrial and Applied Mathematics (2013)
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Article MathSciNet MATH Google Scholar
Mantaci, S., Restivo, A., Sciortino, M.: Burrows–Wheeler transform and sturmian words. Inf. Process. Lett. 86(5), 241–246 (2003)
Article MathSciNet MATH Google Scholar
Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. SIAM J. Comput. 43(5), 1781–1806 (2014)
Article MathSciNet MATH Google Scholar
Nishimoto, T., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Dynamic Index and LZ Factorization in Compressed Space. In Jan, H., Jan, Ž. (eds.) Proceedings of the Prague Stringology Conference 2016, pp. 158–170. Czech Technical University in Prague, Czech Republic (2016)
Ohlebusch, E., Gog, S.: Lempel–Ziv factorization revisited. In: Proceedings of 22nd Annual Symposium on Combinatorial Pattern Matching, CPM 2011, Palermo, Italy, June 27–29, 2011, vol. 6661, p. 15. Springer (2011)
pizza&chili repetitive corpus. http://pizzachili.dcc.uchile.cl/repcorpus/real/. Accessed 20 May 2016
Policriti, A., Gigante, N., Prezza, N.: Average linear time and compressed space construction of the Burrows–Wheeler transform. In: Language and Automata Theory and Applications. Lecture Notes in Computer Science, vol. 8977, pp. 587–598. Springer International Publishing (2015)
Policriti, A., Prezza, N.: Fast Online Lempel–Ziv Factorization in Compressed Space. In: String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 9309, pp. 13–20. Springer International Publishing (2015). doi:10.1007/978-3-319-23826-5_2
Rytter, W.: Application of Lempel–Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1), 211–222 (2003)
Article MathSciNet MATH Google Scholar
Sirén, J., et al.: Compressed full-text indexes for highly repetitive collections. PhD thesis, Helsingin yliopisto (2012)
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Proceedings of 15th International Symposium on String Processing and Information Retrieval, SPIRE 2008, Melbourne, Australia, November 10–12, 2008, vol. 5280, p. 164. Springer (2008)
Takabatake, Y., Tabei, Y., Sakamoto, H.: Online self-indexed grammar compression. In Proceedings of SPIRE. Lecture Notes in Computer Science, vol. 9309, pp. 258–269. Springer International Publishing (2015)
Tamakoshi, Y., Tomohiro I., Inenaga, S., Bannai, H., Takeda, M.: From run length encoding to LZ78 and back again. In: Data Compression Conference (DCC), 2013, pp. 143–152. IEEE (2013)
wiki-get: Download all versions of a Wikipedia page. https://github.com/nicolaprezza/wiki_get. Accessed 20 May 2016
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Mathematics, and Physics, University of Udine, Udine, Italy
Alberto Policriti & Nicola Prezza
Applied Genomics Institute, Udine, Italy
Alberto Policriti

Authors

Alberto Policriti
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Prezza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicola Prezza.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Policriti, A., Prezza, N. LZ77 Computation Based on the Run-Length Encoded BWT. Algorithmica 80, 1986–2011 (2018). https://doi.org/10.1007/s00453-017-0327-z

Download citation

Received: 30 June 2016
Accepted: 25 May 2017
Published: 17 July 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s00453-017-0327-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LZ77 Computation Based on the Run-Length Encoded BWT

Abstract

Access this article

Similar content being viewed by others

Non-overlapping Indexing in BWT-Runs Bounded Space

Composite Repetition-Aware Data Structures

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Non-overlapping Indexing in BWT-Runs Bounded Space

Composite Repetition-Aware Data Structures

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation