Faster Prefix-Sorting Algorithms for Deterministic Finite Automata

Authors Sung-Hwan Kim , Francisco Olivares , Nicola Prezza



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.16.pdf
  • Filesize: 0.93 MB
  • 16 pages

Document Identifiers

Author Details

Sung-Hwan Kim
  • DAIS, Ca' Foscari University of Venice, Italy
Francisco Olivares
  • CeBiB - Centre for Biotechnology and Bioengineering, Santiago, Chile
  • Department of Computer Science, University of Chile, Santiago, Chile
Nicola Prezza
  • DAIS, Ca' Foscari University of Venice, Italy

Cite AsGet BibTex

Sung-Hwan Kim, Francisco Olivares, and Nicola Prezza. Faster Prefix-Sorting Algorithms for Deterministic Finite Automata. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.CPM.2023.16

Abstract

Sorting is a fundamental algorithmic pre-processing technique which often allows to represent data more compactly and, at the same time, speeds up search queries on it. In this paper, we focus on the well-studied problem of sorting and indexing string sets. Since the introduction of suffix trees in 1973, dozens of suffix sorting algorithms have been described in the literature. In 2017, these techniques were extended to sets of strings described by means of finite automata: the theory of Wheeler graphs [Gagie et al., TCS'17] introduced automata whose states can be totally-sorted according to the co-lexicographic (co-lex in the following) order of the prefixes of words accepted by the automaton. More recently, in [Cotumaccio, Prezza, SODA'21] it was shown how to extend these ideas to arbitrary automata by means of partial co-lex orders. This work showed that a co-lex order of minimum width (thus optimizing search query times) on deterministic finite automata (DFAs) can be computed in O(m² + n^{5/2}) time, m being the number of transitions and n the number of states of the input DFA. In this paper, we exhibit new combinatorial properties of the minimum-width co-lex order of DFAs and exploit them to design faster prefix sorting algorithms. In particular, we describe two algorithms sorting arbitrary DFAs in O(mn) and O(n² log n) time, respectively, and an algorithm sorting acyclic DFAs in O(m log n) time. Within these running times, all algorithms compute also a smallest chain partition of the partial order (required to index the DFA). We present an experiment result to show that an optimized implementation of the O(n² log n)-time algorithm exhibits a nearly-linear behaviour on large deterministic pan-genomic graphs and is thus also of practical interest.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Theory of computation → Formal languages and automata theory
Keywords
  • String Matching
  • Deterministic Finite Automata
  • Graph Indexing
  • Co-lexicographical Sorting

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of 31st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 911-930, 2020. URL: https://doi.org/10.1137/1.9781611975994.55.
  2. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn Graphs. In Proceedings of the 12th International Workshop on Algorithms in Bioinformatics (WABI), pages 225-235, 2012. URL: https://doi.org/10.1007/978-3-642-33122-0_18.
  3. Martin C. Carlisle and Errol L. Lloyd. On the k-coloring of intervals. Discrete Applied Mathematics, 59:225-235, 1995. URL: https://doi.org/10.1016/0166-218X(95)80003-M.
  4. Nicola Cotumaccio. Graphs can be succinctly indexed for pattern matching in O(| E| ²+| V| ^5/2) time. In 2022 Data Compression Conference (DCC), pages 272-281, 2022. URL: https://doi.org/10.1109/DCC52660.2022.00035.
  5. Nicola Cotumaccio, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages. Part I, 2022. URL: https://doi.org/10.48550/ARXIV.2208.04931.
  6. Nicola Cotumaccio and Nicola Prezza. On Indexing and Compressing Finite Automata. In Proceedings of the 32nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2585-2599, 2021. URL: https://doi.org/10.1137/1.9781611976465.153.
  7. Robert P. Dilworth. A Decomposition Theorem for Partially Ordered Sets. Annals of Mathematics, 51(1):161-166, 1950. URL: https://doi.org/10.2307/1969503.
  8. Ulrich Faigle and Willem M. Nawijn. Note on Scheduling Intervals on-line. Discrete Applied Mathematics, 58(1):13-17, 1995. URL: https://doi.org/10.1016/0166-218X(95)00112-5.
  9. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 184-196, 2005. URL: https://doi.org/10.1109/SFCS.2005.69.
  10. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 390-398, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
  11. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67-78, 2017. URL: https://doi.org/10.1016/j.tcs.2017.06.016.
  12. Daniel Gibney and Sharma V. Thankachan. On the Hardness and Inapproximability of Recognizing Wheeler Graphs. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA), pages 51:1-51:16, 2019. URL: https://doi.org/10.4230/LIPIcs.ESA.2019.51.
  13. Gaston H. Gonnet, Ricardo A. Baeza-Yates, and Tim Snider. New Indices for Text: Pat Trees and Pat Arrays. In Information Retrieval: Data Structures & Algorithms, pages 66-82. Prentice Hall, 1992. Google Scholar
  14. Mitchel T. Keller and Wiliam T. Trotter. Applied Combinatorics. CreateSpace Independent Publishing Platform, 2017. URL: https://www.appliedcombinatorics.org/book/app-comb.html.
  15. Jon Kleinberg and Éva Tardos. Algorithm Design. Pearson/Addison-Wesley, 2006. Google Scholar
  16. Shimon Kogan and Merav Parter. Beating Matrix Multiplication for n ^1/3-Directed Shortcuts. In Prooceedings of 49th International Colloquium on Automata, Languages, and Programming (ICALP), pages 82:1-82:20, 2022. URL: https://doi.org/10.4230/LIPIcs.ICALP.2022.82.
  17. Ernesto Lowy-Gallego, Susan Fairley, Xiangqun Zheng-Bradley, Magali Ruffier, Laura Clarke, Paul Flicek, and The 1000 Genomes Project Consortium. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project. Wellcome Open Research, 4(50), 2020. URL: https://doi.org/10.12688/wellcomeopenres.15126.2.
  18. Udi Manber and Gene Myers. Suffix Arrays: A New Method for on-Line String Searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 319-327, 1990. URL: https://doi.org/10.1137/0222058.
  19. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007. URL: https://doi.org/10.1016/j.tcs.2007.07.014.
  20. M. O. Rabin and D. Scott. Finite Automata and Their Decision Problems. IBM Journal on Research and Development, 3(2):114-125, 1959. URL: https://doi.org/10.1147/rd.32.0114.
  21. Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. Bioinformatics, 36(2):400-407, 2019. URL: https://doi.org/10.1093/bioinformatics/btz575.
  22. Peter Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1-11, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail