research-article

Clustered Elias-Fano Indexes

Authors:
Giulio Ermanno Pibiri

University of Pisa

University of Pisa
View Profile

,
Rossano Venturini

University of Pisa

University of Pisa
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 36 Issue 1Article No.: 2pp 1–33https://doi.org/10.1145/3052773

Published:06 April 2017Publication History

ACM Transactions on Information Systems

Abstract

State-of-the-art encoders for inverted indexes compress each posting list individually. Encoding clusters of posting lists offers the possibility of reducing the redundancy of the lists while maintaining a noticeable query processing speed.

In this article, we propose a new index representation based on clustering the collection of posting lists and, for each created cluster, building an ad hoc reference list with respect to which all lists in the cluster are encoded with Elias-Fano. We describe a posting lists clustering algorithm tailored for our encoder and two methods for building the reference list for a cluster. Both approaches are heuristic and differ in the way postings are added to the reference list: according to their frequency in the cluster or according to the number of bits necessary for their representation.

The extensive experimental analysis indicates that significant space reductions are indeed possible, beating the best state-of-the-art encoders.

References

Charu Aggarwal and Chandan Reddy. 2013. Data Clustering: Algorithms and Applications. Chapman and Hall/CRC. Google ScholarCross Ref
Vo Ngoc Anh and Alistair Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval Journal 8, 1 (2005), 151--166. Google ScholarDigital Library
Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2 (2010), 131--147. Google ScholarDigital Library
David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07). 1027--1035. Google ScholarDigital Library
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable K-means++. In Proceedings of the Very Large Database Endowment (PVLDB’12), Vol. 5. 622--633. Google ScholarDigital Library
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Y. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM’03). 426--434. Google ScholarDigital Library
Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at twitter. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 1360--1369. Google ScholarDigital Library
Stefan Büttcher and Charles Clarke. 2007. Index compression is good, especially for random access. In Proceedings of the 16th ACM International Conference on Information and Knowledge Management (CIKM’07). 761--770. Google ScholarDigital Library
Stefan Büttcher, Charles Clarke, and Gordon Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. Google ScholarDigital Library
Surajit Chaudhuri, Kenneth Church, Arnd Christian König, and Liying Sui. 2007. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 663--670. Google ScholarDigital Library
David Clark. 1996. Compact Pat Trees. Ph.D. Dissertation. University of Waterloo. Google ScholarDigital Library
Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko, Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin, Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, and Ning Zhang. 2013. Unicorn: A system for searching the social graph. In Proceedings of the Very Large Database Endowment (PVLDB’13), Vol. 6. 1150--1161. Google ScholarDigital Library
Renaud Delbru, Stéphane Campinas, and Giovanni Tummarello. 2012. Searching web data: An entity retrieval and high-performance indexing model. Journal of Web Semantics 10 (2012), 33--58. Google ScholarDigital Library
Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM Conference on Knowledge Discovery and Data Mining (KDD’16). Google ScholarDigital Library
Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM) 21, 2 (1974), 246--260. Google ScholarDigital Library
Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).Google Scholar
Paolo Ferragina, Igor Nitto, and Rossano Venturini. 2011. On optimally partitioning a text to improve its compression. Algorithmica 61, 1 (2011), 51--74.Google ScholarDigital Library
Michael Garey and David Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company. Google ScholarDigital Library
Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing relations and indexes. In Proceedings of the 14th International Conference on Data Engineering (ICDE’98). 370--379. Google ScholarDigital Library
Hoang Thanh Lam, Raffaele Perego, Nguyen Thoi Minh Quan, and Fabrizio Silvestri. 2009. Entry pairing in inverted file. In Proceedings of the 10th International Conference Web Information Systems Engineering (WISE’09). 511--522. Google ScholarDigital Library
Daniel Lemire and Leonid Boytsov. 2013. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2013), 1--29. Google ScholarDigital Library
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory (IT) 28 (1982), 129--137. Google ScholarDigital Library
Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
Alistair Moffat and Lang Stuiver. 2000. Binary interpolative coding for effective index compression. Information Retrieval Journal 3, 1 (2000), 25--47. Google ScholarDigital Library
Giuseppe Ottaviano, Nicola Tonellotto, and Rossano Venturini. 2015. Optimal space-time tradeoffs for inverted indexes. In Proceedings of the 8th Annual International ACM Conference on Web Search and Data Mining (WSDM’15). 47--56. Google ScholarDigital Library
Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’14). 273--282. Google ScholarDigital Library
Dan Pelleg and Andrew Moore. 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). 727--734. Google ScholarDigital Library
Stephen Robertson and Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science 27, 3 (1976), 129--146.Google ScholarCross Ref
David Salomon. 2007. Variable-Length Codes for Data Compression. Springer. Google ScholarDigital Library
Fabrizio Silvestri. 2007. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR’07). 101--112. Google ScholarDigital Library
Fabrizio Silvestri and Rossano Venturini. 2010. VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). 1219--1228. Google ScholarDigital Library
Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering techniques. In Proceedings of the 6th Annual Conference on Knowledge Discovery and Data Mining (KDD’00), Workshop on Text Mining. 109--111.Google Scholar
Alexander Stepanov, Anil Gangolli, Daniel Rose, Ryan Ernst, and Paramjit Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM’11). 317--326. Google ScholarDigital Library
Sebastiano Vigna. 2013. Quasi-succinct indices. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM’13). 83--92. Google ScholarDigital Library
Ian Witten, Alistair Moffat, and Timothy Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). Morgan Kaufmann. Google ScholarDigital Library
Rui Xu and Donald Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks (NN) 16, 3 (2005), 645--678. Google ScholarDigital Library
Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 401--410. Google ScholarDigital Library
Zhaohua Zhang, Jiancong Tong, Haibing Huang, Jin Liang, Tianlong Li, Rebecca J. Stones, Gang Wang, and Xiaoguang Liu. 2016. Leveraging context-free grammar for efficient inverted index compression. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 275--284. Google ScholarDigital Library
Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM Computing Surveys (CSUR) 38, 2 (2006), 1--56. Google ScholarDigital Library
Marcin Zukowski, Sándor Héman, Niels Nes, and Peter Boncz. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). 59--70. Google ScholarDigital Library

Index Terms

Clustered Elias-Fano Indexes
1. Information systems

Recommendations

A Hybrid BitFunnel and Partitioned Elias-Fano Inverted Index
WWW '19: The World Wide Web Conference

Search engines encounter a time vs. space trade-off: search responsiveness (i.e., a short query response time) comes at the cost of increased index storage. We propose a hybrid method which uses both (a) the recently published mapping-matrix-style index ...
Read More
Techniques for Inverted Index Compression
Invited Tutorial and Regular Papers

The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance ...
Read More
Partitioned Elias-Fano indexes
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

The Elias-Fano representation of monotone sequences has been recently applied to the compression of inverted indexes, showing excellent query performance thanks to its efficient random access and search operations. While its space occupancy is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 36, Issue 1
January 2018
334 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3077622
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2017 ACM
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 April 2017
- Accepted: 1 January 2017
- Revised: 1 December 2016
- Received: 1 June 2016
Published in tois Volume 36, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Elias-Fano encoding
inverted indexes
performance
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 420
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clustered Elias-Fano Indexes

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Hybrid BitFunnel and Partitioned Elias-Fano Inverted Index

Techniques for Inverted Index Compression

Partitioned Elias-Fano indexes