Abstract
Two fundamental problems concern the handling of large n-gram language models: indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source.
Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation.
Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time.
Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space.
Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk.
The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.
- 2006. Yahoo! N-Grams, version 2.0. Retrieved January 16, 2019 from http://webscope.sandbox.yahoo.com/catalog.php?datatype=l.Google Scholar
- Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive query auto-completion. In International World Wide Web Conference (WWW’11). 107--116. Google ScholarDigital Library
- Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna. 2014. Cache-oblivious peeling of random hypergraphs. In International Data Compression Conference (DCC’14). 352--361.Google ScholarCross Ref
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
- Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In International Conference Empirical Methods on Natural Language Processing (EMNLP’07). 858--867.Google Scholar
- Thorsten Brantz and Alex Franz. 2006. The Google web 1T 5-Gram corpus, http://storage.googleapis.com/books/ngrams/books/datasetsv2.html. In Linguistic Data Consortium, Philadelphia, PA, Technical Report LDC2006T13.Google Scholar
- Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH. 2635--2639.Google Scholar
- Ciprian Chelba and Johan Schalkwyk. 2013. Empirical exploration of language modeling for the google.com query stream as applied to mobile voice search. In Mobile Speech and Advanced Natural Language Solutions, A. Neustein and J. A. Markowitz (eds.). Springer. 197--229.Google Scholar
- Stanley Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Association for Computational Linguistics (ACL). 310--318. Google ScholarDigital Library
- Stanley Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. In Computer Speech and Language, 13, 4 (1999), 359--394. Google ScholarDigital Library
- Wenlin Chen, David Grangier, and Michael Auli. 2015. Strategies for training large vocabulary neural language models. In Preprint arXiv:1512.04906.Google Scholar
- David R. Clark and J. Ian Munro. 1996. Efficient suffix trees on secondary storage. In Symposium on Discrete Algorithms (SODA’96). 383--391. Google ScholarDigital Library
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). MIT Press. Google ScholarDigital Library
- Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: Information Retrieval in Practice (1st ed.). Addison-Wesley Publishing Company. Google ScholarDigital Library
- Erik D. Demaine, Thouis Jones, and Mihai Pătraşcu. 2004. Interpolation search for non-independent data. In Symposium on Discrete Algorithms (SODA’04). 529--530. Google ScholarDigital Library
- Roman Dementiev, Lutz Kettner, and Peter Sanders. 2008. STXXL: Standard template library for XXL data sets. Software, Practice and Experience 38, 6 (2008), 589--637. Google ScholarDigital Library
- Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. Journal of the ACM 21, 2 (1974), 246--260. Google ScholarDigital Library
- Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. In Memorandum 61, Computer Structures Group, MIT. MIT, Cambridge, MA.Google Scholar
- Marcello Federico and Nicola Bertoldi. 2006. How many bits are needed to store probabilities for phrase-based translation?. In Workshop on Statistical Machine Translation (WMT’06). 94--101. Google ScholarDigital Library
- Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In INTERSPEECH. 1618--1621.Google Scholar
- Edward Fredkin. 1960. Trie memory. Communications of the ACM 3, 9 (1960), 490--499. Google ScholarDigital Library
- Kimmo Fredriksson and Fedor Nikitin. 2007. Simple compression code supporting random access and fast string matching. In Workshop on Experimental Algorithms (WEA’07). 203--216. Google ScholarDigital Library
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation (WMT’11). 187--197. Google ScholarDigital Library
- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Association for Computational Linguistics (ACL). 690--696.Google Scholar
- Steffen Heinz, Justin Zobel, and Hugh E. Williams. 2002. Burst tries: A fast, efficient data structure for string keys. Transactions on Information Systems 20, 2 (2002), 192--223. Google ScholarDigital Library
- Samuel Huston, Alistair Moffat, and W. Bruce Croft. 2011. Efficient indexing of repeated n-grams. In International Conference on Web Search and Data Mining (WSDM’11). 127--136. Google ScholarDigital Library
- Guy Jacobson. 1989. Space-efficient static trees and graphs. In Foundations of Computer Science (FOCS'89). 549--554. Google ScholarDigital Library
- Dan Jurafsky and James H. Martin. 2014. Speech and Language Processing. Pearson.Google Scholar
- Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95), Vol. 1. 181--184.Google ScholarCross Ref
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation, http://www.statmt.org/europarl. In MT Summit. 79--86.Google Scholar
- Grzegorz Kondrak. 2005. N-gram similarity and distance. In International Symposium on String Processing and Information Retrieval (SPIRE). Springer, 115--126. Google ScholarDigital Library
- Karen Kukich. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24, 4 (1992), 377--439. Google ScholarDigital Library
- Ted G. Lewis and Curtis R. Cook. 1988. Hashing for dynamic and static internal tables. Computer 21, 10 (1988), 45--56. Google ScholarDigital Library
- Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. In International Conference on Information and Knowledge Management (CIKM’15). 1755--1758. Google ScholarDigital Library
- Bhaskar Mitra, Milad Shokouhi, Filip Radlinski, and Katja Hofmann. 2014. On user interactions with query auto-completion. In International Conference on Research and Development in Information Retrieval (SIGIR’14). 1055--1058. Google ScholarDigital Library
- Alistair Moffat and Lang Stuiver. 2000. Binary interpolative coding for effective index compression. Information Retrieval Journal 3, 1 (2000), 25--47. Google ScholarDigital Library
- Donald R. Morrison. 1968. PATRICIA: Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 514--534. Google ScholarDigital Library
- Gonzalo Navarro, Ricardo A. Baeza-Yates, Erkki Sutinen, and Jorma Tarhio. 2001. Indexing methods for approximate string matching. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 19--27.Google Scholar
- Patrick Nguyen, Jianfeng Gao, and Milind Mahajan. 2007. MSRLM: A scalable language modeling toolkit. In Microsoft Research MSR-TR-2007-144. 2007.Google Scholar
- Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In International Conference on Research and Development in Information Retrieval (SIGIR’14). 273--282. Google ScholarDigital Library
- Adam Pauls and Dan Klein. 2011. Faster and smaller N-gram language models. In Association for Computational Linguistics (ACL). 258--267. Google ScholarDigital Library
- Giulio Ermanno Pibiri and Rossano Venturini. 2017. Efficient data structures for massive N-gram datasets. In International Conference on Research and Development in Information Retrieval (SIGIR’17). 615--624. Google ScholarDigital Library
- Giulio Ermanno Pibiri and Rossano Venturini. 2018. Inverted index compression. In Encyclopedia of Big Data Technologies (2018), 1--8.Google ScholarCross Ref
- Bhiksha Raj and Ed Whittaker. 2003. Lossless compression of language model structure and word identifiers. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03). 388--391.Google ScholarCross Ref
- David Salomon. 2007. Variable-length Codes for Data Compression. Springer. Google ScholarDigital Library
- Jangwon Seo and W. Bruce Croft. 2008. Local text reuse detection. In International Conference on Research and Development in Information Retrieval (SIGIR’08). 571--578. Google ScholarDigital Library
- Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, and Trevor Cohn. 2015. Compact, efficient and unlimited capacity: Language modeling with compressed suffix trees. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 2409--2418.Google ScholarCross Ref
- Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, and Trevor Cohn. 2016. Fast, small and exact: Infinite-order language modelling with compressed suffix trees. Transactions of the Association of Computational Linguistics 4, 1 (2016), 477--490.Google ScholarCross Ref
- Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In International Conference on Spoken Language Processing (ICSLP’02). 901--904.Google Scholar
- David Talbot and Miles Osborne. 2007. Randomised language modelling for statistical machine translation. In Association for Computational Linguistics (ACL). 512--519.Google Scholar
- Larry H. Thiel and H. S. Heaps. 1972. Program design for retrospective searches on large data bases. Information Storage and Retrieval 8, 1 (1972), 1--20.Google ScholarCross Ref
- Sebastiano Vigna. 2013. Quasi-succinct indices. In International Conference on Web Search and Data Mining (WSDM). 83--92. Google ScholarDigital Library
- Jeffrey Scott Vitter. 1998. External memory algorithms. In European Symposium on Algorithms (ESA’98). 1--25.Google ScholarDigital Library
- Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. 2009. A succinct N-gram language model. In International Joint Conference on Natural Language Processing (IJCNLP’09). 341--344. Google ScholarDigital Library
- Ian H. Witten, Alistair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. Google ScholarDigital Library
- Susumu Yata. 2011. Prefix/Patricia trie dictionary compression by nesting Prefix/Patricia tries. In International Conference on Natural Language Processing (NLP’11).Google Scholar
Index Terms
- Handling Massive N-Gram Datasets Efficiently
Recommendations
Efficient Data Structures for Massive N-Gram Datasets
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalThe efficient indexing of large and sparse N-gram datasets is crucial in several applications in Information Retrieval, Natural Language Processing and Machine Learning. Because of the stringent efficiency requirements, dealing with billions of N-grams ...
n-Gram-Based Text Compression
We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it ...
Character n-Gram Spotting in Document Images
ICDAR '11: Proceedings of the 2011 International Conference on Document Analysis and RecognitionIn this paper, we present a novel approach to search and retrieve from document image collections, without explicit recognition. Existing recognition-free approaches such as word-spotting cannot scale to arbitrarily large vocabulary and document image ...
Comments