research-article

Handling Massive N-Gram Datasets Efficiently

Authors:
Giulio Ermanno Pibiri

University of Pisa and ISTI-CNR, Pisa, Italy

University of Pisa and ISTI-CNR, Pisa, Italy

0000-0003-0724-7092
View Profile

,
Rossano Venturini

University of Pisa and ISTI-CNR, Pisa, Italy

University of Pisa and ISTI-CNR, Pisa, Italy
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 37 Issue 2Article No.: 25pp 1–41https://doi.org/10.1145/3302913

Published:11 February 2019Publication History

ACM Transactions on Information Systems

Abstract

Two fundamental problems concern the handling of large n-gram language models: indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source.

Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation.

Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time.

Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space.

Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk.

The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.

References

2006. Yahoo&excl; N-Grams, version 2.0. Retrieved January 16, 2019 from http://webscope.sandbox.yahoo.com/catalog.php?datatype=l.Google Scholar
Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive query auto-completion. In International World Wide Web Conference (WWW’11). 107--116. Google ScholarDigital Library
Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna. 2014. Cache-oblivious peeling of random hypergraphs. In International Data Compression Conference (DCC’14). 352--361.Google ScholarCross Ref
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In International Conference Empirical Methods on Natural Language Processing (EMNLP’07). 858--867.Google Scholar
Thorsten Brantz and Alex Franz. 2006. The Google web 1T 5-Gram corpus, http://storage.googleapis.com/books/ngrams/books/datasetsv2.html. In Linguistic Data Consortium, Philadelphia, PA, Technical Report LDC2006T13.Google Scholar
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH. 2635--2639.Google Scholar
Ciprian Chelba and Johan Schalkwyk. 2013. Empirical exploration of language modeling for the google.com query stream as applied to mobile voice search. In Mobile Speech and Advanced Natural Language Solutions, A. Neustein and J. A. Markowitz (eds.). Springer. 197--229.Google Scholar
Stanley Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Association for Computational Linguistics (ACL). 310--318. Google ScholarDigital Library
Stanley Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. In Computer Speech and Language, 13, 4 (1999), 359--394. Google ScholarDigital Library
Wenlin Chen, David Grangier, and Michael Auli. 2015. Strategies for training large vocabulary neural language models. In Preprint arXiv:1512.04906.Google Scholar
David R. Clark and J. Ian Munro. 1996. Efficient suffix trees on secondary storage. In Symposium on Discrete Algorithms (SODA’96). 383--391. Google ScholarDigital Library
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). MIT Press. Google ScholarDigital Library
Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: Information Retrieval in Practice (1st ed.). Addison-Wesley Publishing Company. Google ScholarDigital Library
Erik D. Demaine, Thouis Jones, and Mihai Pătraşcu. 2004. Interpolation search for non-independent data. In Symposium on Discrete Algorithms (SODA’04). 529--530. Google ScholarDigital Library
Roman Dementiev, Lutz Kettner, and Peter Sanders. 2008. STXXL: Standard template library for XXL data sets. Software, Practice and Experience 38, 6 (2008), 589--637. Google ScholarDigital Library
Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. Journal of the ACM 21, 2 (1974), 246--260. Google ScholarDigital Library
Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. In Memorandum 61, Computer Structures Group, MIT. MIT, Cambridge, MA.Google Scholar
Marcello Federico and Nicola Bertoldi. 2006. How many bits are needed to store probabilities for phrase-based translation?. In Workshop on Statistical Machine Translation (WMT’06). 94--101. Google ScholarDigital Library
Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In INTERSPEECH. 1618--1621.Google Scholar
Edward Fredkin. 1960. Trie memory. Communications of the ACM 3, 9 (1960), 490--499. Google ScholarDigital Library
Kimmo Fredriksson and Fedor Nikitin. 2007. Simple compression code supporting random access and fast string matching. In Workshop on Experimental Algorithms (WEA’07). 203--216. Google ScholarDigital Library
Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation (WMT’11). 187--197. Google ScholarDigital Library
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Association for Computational Linguistics (ACL). 690--696.Google Scholar
Steffen Heinz, Justin Zobel, and Hugh E. Williams. 2002. Burst tries: A fast, efficient data structure for string keys. Transactions on Information Systems 20, 2 (2002), 192--223. Google ScholarDigital Library
Samuel Huston, Alistair Moffat, and W. Bruce Croft. 2011. Efficient indexing of repeated n-grams. In International Conference on Web Search and Data Mining (WSDM’11). 127--136. Google ScholarDigital Library
Guy Jacobson. 1989. Space-efficient static trees and graphs. In Foundations of Computer Science (FOCS'89). 549--554. Google ScholarDigital Library
Dan Jurafsky and James H. Martin. 2014. Speech and Language Processing. Pearson.Google Scholar
Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95), Vol. 1. 181--184.Google ScholarCross Ref
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation, http://www.statmt.org/europarl. In MT Summit. 79--86.Google Scholar
Grzegorz Kondrak. 2005. N-gram similarity and distance. In International Symposium on String Processing and Information Retrieval (SPIRE). Springer, 115--126. Google ScholarDigital Library
Karen Kukich. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24, 4 (1992), 377--439. Google ScholarDigital Library
Ted G. Lewis and Curtis R. Cook. 1988. Hashing for dynamic and static internal tables. Computer 21, 10 (1988), 45--56. Google ScholarDigital Library
Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. In International Conference on Information and Knowledge Management (CIKM’15). 1755--1758. Google ScholarDigital Library
Bhaskar Mitra, Milad Shokouhi, Filip Radlinski, and Katja Hofmann. 2014. On user interactions with query auto-completion. In International Conference on Research and Development in Information Retrieval (SIGIR’14). 1055--1058. Google ScholarDigital Library
Alistair Moffat and Lang Stuiver. 2000. Binary interpolative coding for effective index compression. Information Retrieval Journal 3, 1 (2000), 25--47. Google ScholarDigital Library
Donald R. Morrison. 1968. PATRICIA: Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 514--534. Google ScholarDigital Library
Gonzalo Navarro, Ricardo A. Baeza-Yates, Erkki Sutinen, and Jorma Tarhio. 2001. Indexing methods for approximate string matching. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 19--27.Google Scholar
Patrick Nguyen, Jianfeng Gao, and Milind Mahajan. 2007. MSRLM: A scalable language modeling toolkit. In Microsoft Research MSR-TR-2007-144. 2007.Google Scholar
Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In International Conference on Research and Development in Information Retrieval (SIGIR’14). 273--282. Google ScholarDigital Library
Adam Pauls and Dan Klein. 2011. Faster and smaller N-gram language models. In Association for Computational Linguistics (ACL). 258--267. Google ScholarDigital Library
Giulio Ermanno Pibiri and Rossano Venturini. 2017. Efficient data structures for massive N-gram datasets. In International Conference on Research and Development in Information Retrieval (SIGIR’17). 615--624. Google ScholarDigital Library
Giulio Ermanno Pibiri and Rossano Venturini. 2018. Inverted index compression. In Encyclopedia of Big Data Technologies (2018), 1--8.Google ScholarCross Ref
Bhiksha Raj and Ed Whittaker. 2003. Lossless compression of language model structure and word identifiers. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03). 388--391.Google ScholarCross Ref
David Salomon. 2007. Variable-length Codes for Data Compression. Springer. Google ScholarDigital Library
Jangwon Seo and W. Bruce Croft. 2008. Local text reuse detection. In International Conference on Research and Development in Information Retrieval (SIGIR’08). 571--578. Google ScholarDigital Library
Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, and Trevor Cohn. 2015. Compact, efficient and unlimited capacity: Language modeling with compressed suffix trees. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 2409--2418.Google ScholarCross Ref
Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, and Trevor Cohn. 2016. Fast, small and exact: Infinite-order language modelling with compressed suffix trees. Transactions of the Association of Computational Linguistics 4, 1 (2016), 477--490.Google ScholarCross Ref
Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In International Conference on Spoken Language Processing (ICSLP’02). 901--904.Google Scholar
David Talbot and Miles Osborne. 2007. Randomised language modelling for statistical machine translation. In Association for Computational Linguistics (ACL). 512--519.Google Scholar
Larry H. Thiel and H. S. Heaps. 1972. Program design for retrospective searches on large data bases. Information Storage and Retrieval 8, 1 (1972), 1--20.Google ScholarCross Ref
Sebastiano Vigna. 2013. Quasi-succinct indices. In International Conference on Web Search and Data Mining (WSDM). 83--92. Google ScholarDigital Library
Jeffrey Scott Vitter. 1998. External memory algorithms. In European Symposium on Algorithms (ESA’98). 1--25.Google ScholarDigital Library
Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. 2009. A succinct N-gram language model. In International Joint Conference on Natural Language Processing (IJCNLP’09). 341--344. Google ScholarDigital Library
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. Google ScholarDigital Library
Susumu Yata. 2011. Prefix/Patricia trie dictionary compression by nesting Prefix/Patricia tries. In International Conference on Natural Language Processing (NLP’11).Google Scholar

Index Terms

Handling Massive N-Gram Datasets Efficiently
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. Information retrieval
    1. Retrieval models and ranking
      1. Language models
    2. Retrieval tasks and goals
      1. Information extraction

Recommendations

Efficient Data Structures for Massive N-Gram Datasets
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

The efficient indexing of large and sparse N-gram datasets is crucial in several applications in Information Retrieval, Natural Language Processing and Machine Learning. Because of the stringent efficiency requirements, dealing with billions of N-grams ...
Read More
n-Gram-Based Text Compression

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it ...
Read More
Character n-Gram Spotting in Document Images
ICDAR '11: Proceedings of the 2011 International Conference on Document Analysis and Recognition

In this paper, we present a novel approach to search and retrieve from document image collections, without explicit recognition. Existing recognition-free approaches such as word-spotting cannot scale to arbitrarily large vocabulary and document image ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 37, Issue 2
April 2019
410 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3306215
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 February 2019
- Accepted: 1 December 2018
- Revised: 1 November 2018
- Received: 1 June 2018
Published in tois Volume 37, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Efficiency
algorithm engineering
scalability
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 358
  Total Downloads
- Downloads (Last 12 months)59
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Handling Massive N-Gram Datasets Efficiently

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Efficient Data Structures for Massive N-Gram Datasets

n-Gram-Based Text Compression

Character n-Gram Spotting in Document Images