skip to main content
research-article

Handling Massive N-Gram Datasets Efficiently

Published:11 February 2019Publication History
Skip Abstract Section

Abstract

Two fundamental problems concern the handling of large n-gram language models: indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source.

Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation.

Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time.

Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space.

Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk.

The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.

References

  1. 2006. Yahoo! N-Grams, version 2.0. Retrieved January 16, 2019 from http://webscope.sandbox.yahoo.com/catalog.php?datatype=l.Google ScholarGoogle Scholar
  2. Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive query auto-completion. In International World Wide Web Conference (WWW’11). 107--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna. 2014. Cache-oblivious peeling of random hypergraphs. In International Data Compression Conference (DCC’14). 352--361.Google ScholarGoogle ScholarCross RefCross Ref
  4. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In International Conference Empirical Methods on Natural Language Processing (EMNLP’07). 858--867.Google ScholarGoogle Scholar
  6. Thorsten Brantz and Alex Franz. 2006. The Google web 1T 5-Gram corpus, http://storage.googleapis.com/books/ngrams/books/datasetsv2.html. In Linguistic Data Consortium, Philadelphia, PA, Technical Report LDC2006T13.Google ScholarGoogle Scholar
  7. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH. 2635--2639.Google ScholarGoogle Scholar
  8. Ciprian Chelba and Johan Schalkwyk. 2013. Empirical exploration of language modeling for the google.com query stream as applied to mobile voice search. In Mobile Speech and Advanced Natural Language Solutions, A. Neustein and J. A. Markowitz (eds.). Springer. 197--229.Google ScholarGoogle Scholar
  9. Stanley Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Association for Computational Linguistics (ACL). 310--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Stanley Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. In Computer Speech and Language, 13, 4 (1999), 359--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wenlin Chen, David Grangier, and Michael Auli. 2015. Strategies for training large vocabulary neural language models. In Preprint arXiv:1512.04906.Google ScholarGoogle Scholar
  12. David R. Clark and J. Ian Munro. 1996. Efficient suffix trees on secondary storage. In Symposium on Discrete Algorithms (SODA’96). 383--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: Information Retrieval in Practice (1st ed.). Addison-Wesley Publishing Company. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Erik D. Demaine, Thouis Jones, and Mihai Pătraşcu. 2004. Interpolation search for non-independent data. In Symposium on Discrete Algorithms (SODA’04). 529--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Roman Dementiev, Lutz Kettner, and Peter Sanders. 2008. STXXL: Standard template library for XXL data sets. Software, Practice and Experience 38, 6 (2008), 589--637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. Journal of the ACM 21, 2 (1974), 246--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. In Memorandum 61, Computer Structures Group, MIT. MIT, Cambridge, MA.Google ScholarGoogle Scholar
  19. Marcello Federico and Nicola Bertoldi. 2006. How many bits are needed to store probabilities for phrase-based translation?. In Workshop on Statistical Machine Translation (WMT’06). 94--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In INTERSPEECH. 1618--1621.Google ScholarGoogle Scholar
  21. Edward Fredkin. 1960. Trie memory. Communications of the ACM 3, 9 (1960), 490--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kimmo Fredriksson and Fedor Nikitin. 2007. Simple compression code supporting random access and fast string matching. In Workshop on Experimental Algorithms (WEA’07). 203--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation (WMT’11). 187--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Association for Computational Linguistics (ACL). 690--696.Google ScholarGoogle Scholar
  25. Steffen Heinz, Justin Zobel, and Hugh E. Williams. 2002. Burst tries: A fast, efficient data structure for string keys. Transactions on Information Systems 20, 2 (2002), 192--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Samuel Huston, Alistair Moffat, and W. Bruce Croft. 2011. Efficient indexing of repeated n-grams. In International Conference on Web Search and Data Mining (WSDM’11). 127--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Guy Jacobson. 1989. Space-efficient static trees and graphs. In Foundations of Computer Science (FOCS'89). 549--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Dan Jurafsky and James H. Martin. 2014. Speech and Language Processing. Pearson.Google ScholarGoogle Scholar
  29. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95), Vol. 1. 181--184.Google ScholarGoogle ScholarCross RefCross Ref
  30. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation, http://www.statmt.org/europarl. In MT Summit. 79--86.Google ScholarGoogle Scholar
  31. Grzegorz Kondrak. 2005. N-gram similarity and distance. In International Symposium on String Processing and Information Retrieval (SPIRE). Springer, 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Karen Kukich. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24, 4 (1992), 377--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ted G. Lewis and Curtis R. Cook. 1988. Hashing for dynamic and static internal tables. Computer 21, 10 (1988), 45--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. In International Conference on Information and Knowledge Management (CIKM’15). 1755--1758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Bhaskar Mitra, Milad Shokouhi, Filip Radlinski, and Katja Hofmann. 2014. On user interactions with query auto-completion. In International Conference on Research and Development in Information Retrieval (SIGIR’14). 1055--1058. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Alistair Moffat and Lang Stuiver. 2000. Binary interpolative coding for effective index compression. Information Retrieval Journal 3, 1 (2000), 25--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Donald R. Morrison. 1968. PATRICIA: Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 514--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Gonzalo Navarro, Ricardo A. Baeza-Yates, Erkki Sutinen, and Jorma Tarhio. 2001. Indexing methods for approximate string matching. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 19--27.Google ScholarGoogle Scholar
  39. Patrick Nguyen, Jianfeng Gao, and Milind Mahajan. 2007. MSRLM: A scalable language modeling toolkit. In Microsoft Research MSR-TR-2007-144. 2007.Google ScholarGoogle Scholar
  40. Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In International Conference on Research and Development in Information Retrieval (SIGIR’14). 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Adam Pauls and Dan Klein. 2011. Faster and smaller N-gram language models. In Association for Computational Linguistics (ACL). 258--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Giulio Ermanno Pibiri and Rossano Venturini. 2017. Efficient data structures for massive N-gram datasets. In International Conference on Research and Development in Information Retrieval (SIGIR’17). 615--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Giulio Ermanno Pibiri and Rossano Venturini. 2018. Inverted index compression. In Encyclopedia of Big Data Technologies (2018), 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  44. Bhiksha Raj and Ed Whittaker. 2003. Lossless compression of language model structure and word identifiers. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03). 388--391.Google ScholarGoogle ScholarCross RefCross Ref
  45. David Salomon. 2007. Variable-length Codes for Data Compression. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jangwon Seo and W. Bruce Croft. 2008. Local text reuse detection. In International Conference on Research and Development in Information Retrieval (SIGIR’08). 571--578. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, and Trevor Cohn. 2015. Compact, efficient and unlimited capacity: Language modeling with compressed suffix trees. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 2409--2418.Google ScholarGoogle ScholarCross RefCross Ref
  48. Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, and Trevor Cohn. 2016. Fast, small and exact: Infinite-order language modelling with compressed suffix trees. Transactions of the Association of Computational Linguistics 4, 1 (2016), 477--490.Google ScholarGoogle ScholarCross RefCross Ref
  49. Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In International Conference on Spoken Language Processing (ICSLP’02). 901--904.Google ScholarGoogle Scholar
  50. David Talbot and Miles Osborne. 2007. Randomised language modelling for statistical machine translation. In Association for Computational Linguistics (ACL). 512--519.Google ScholarGoogle Scholar
  51. Larry H. Thiel and H. S. Heaps. 1972. Program design for retrospective searches on large data bases. Information Storage and Retrieval 8, 1 (1972), 1--20.Google ScholarGoogle ScholarCross RefCross Ref
  52. Sebastiano Vigna. 2013. Quasi-succinct indices. In International Conference on Web Search and Data Mining (WSDM). 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jeffrey Scott Vitter. 1998. External memory algorithms. In European Symposium on Algorithms (ESA’98). 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. 2009. A succinct N-gram language model. In International Joint Conference on Natural Language Processing (IJCNLP’09). 341--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Susumu Yata. 2011. Prefix/Patricia trie dictionary compression by nesting Prefix/Patricia tries. In International Conference on Natural Language Processing (NLP’11).Google ScholarGoogle Scholar

Index Terms

  1. Handling Massive N-Gram Datasets Efficiently

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Information Systems
          ACM Transactions on Information Systems  Volume 37, Issue 2
          April 2019
          410 pages
          ISSN:1046-8188
          EISSN:1558-2868
          DOI:10.1145/3306215
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 11 February 2019
          • Accepted: 1 December 2018
          • Revised: 1 November 2018
          • Received: 1 June 2018
          Published in tois Volume 37, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format