skip to main content
research-article

Clustered Elias-Fano Indexes

Published:06 April 2017Publication History
Skip Abstract Section

Abstract

State-of-the-art encoders for inverted indexes compress each posting list individually. Encoding clusters of posting lists offers the possibility of reducing the redundancy of the lists while maintaining a noticeable query processing speed.

In this article, we propose a new index representation based on clustering the collection of posting lists and, for each created cluster, building an ad hoc reference list with respect to which all lists in the cluster are encoded with Elias-Fano. We describe a posting lists clustering algorithm tailored for our encoder and two methods for building the reference list for a cluster. Both approaches are heuristic and differ in the way postings are added to the reference list: according to their frequency in the cluster or according to the number of bits necessary for their representation.

The extensive experimental analysis indicates that significant space reductions are indeed possible, beating the best state-of-the-art encoders.

References

  1. Charu Aggarwal and Chandan Reddy. 2013. Data Clustering: Algorithms and Applications. Chapman and Hall/CRC. Google ScholarGoogle ScholarCross RefCross Ref
  2. Vo Ngoc Anh and Alistair Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval Journal 8, 1 (2005), 151--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2 (2010), 131--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07). 1027--1035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable K-means++. In Proceedings of the Very Large Database Endowment (PVLDB’12), Vol. 5. 622--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Y. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM’03). 426--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at twitter. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 1360--1369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stefan Büttcher and Charles Clarke. 2007. Index compression is good, especially for random access. In Proceedings of the 16th ACM International Conference on Information and Knowledge Management (CIKM’07). 761--770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Stefan Büttcher, Charles Clarke, and Gordon Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Surajit Chaudhuri, Kenneth Church, Arnd Christian König, and Liying Sui. 2007. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 663--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. David Clark. 1996. Compact Pat Trees. Ph.D. Dissertation. University of Waterloo. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko, Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin, Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, and Ning Zhang. 2013. Unicorn: A system for searching the social graph. In Proceedings of the Very Large Database Endowment (PVLDB’13), Vol. 6. 1150--1161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Renaud Delbru, Stéphane Campinas, and Giovanni Tummarello. 2012. Searching web data: An entity retrieval and high-performance indexing model. Journal of Web Semantics 10 (2012), 33--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM Conference on Knowledge Discovery and Data Mining (KDD’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM) 21, 2 (1974), 246--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).Google ScholarGoogle Scholar
  17. Paolo Ferragina, Igor Nitto, and Rossano Venturini. 2011. On optimally partitioning a text to improve its compression. Algorithmica 61, 1 (2011), 51--74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael Garey and David Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing relations and indexes. In Proceedings of the 14th International Conference on Data Engineering (ICDE’98). 370--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hoang Thanh Lam, Raffaele Perego, Nguyen Thoi Minh Quan, and Fabrizio Silvestri. 2009. Entry pairing in inverted file. In Proceedings of the 10th International Conference Web Information Systems Engineering (WISE’09). 511--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Daniel Lemire and Leonid Boytsov. 2013. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2013), 1--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory (IT) 28 (1982), 129--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Alistair Moffat and Lang Stuiver. 2000. Binary interpolative coding for effective index compression. Information Retrieval Journal 3, 1 (2000), 25--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Giuseppe Ottaviano, Nicola Tonellotto, and Rossano Venturini. 2015. Optimal space-time tradeoffs for inverted indexes. In Proceedings of the 8th Annual International ACM Conference on Web Search and Data Mining (WSDM’15). 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’14). 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dan Pelleg and Andrew Moore. 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). 727--734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Stephen Robertson and Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science 27, 3 (1976), 129--146.Google ScholarGoogle ScholarCross RefCross Ref
  29. David Salomon. 2007. Variable-Length Codes for Data Compression. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Fabrizio Silvestri. 2007. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR’07). 101--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Fabrizio Silvestri and Rossano Venturini. 2010. VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). 1219--1228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering techniques. In Proceedings of the 6th Annual Conference on Knowledge Discovery and Data Mining (KDD’00), Workshop on Text Mining. 109--111.Google ScholarGoogle Scholar
  33. Alexander Stepanov, Anil Gangolli, Daniel Rose, Ryan Ernst, and Paramjit Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM’11). 317--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Sebastiano Vigna. 2013. Quasi-succinct indices. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM’13). 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ian Witten, Alistair Moffat, and Timothy Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rui Xu and Donald Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks (NN) 16, 3 (2005), 645--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zhaohua Zhang, Jiancong Tong, Haibing Huang, Jin Liang, Tianlong Li, Rebecca J. Stones, Gang Wang, and Xiaoguang Liu. 2016. Leveraging context-free grammar for efficient inverted index compression. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 275--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM Computing Surveys (CSUR) 38, 2 (2006), 1--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Marcin Zukowski, Sándor Héman, Niels Nes, and Peter Boncz. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). 59--70. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Clustered Elias-Fano Indexes

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Information Systems
          ACM Transactions on Information Systems  Volume 36, Issue 1
          January 2018
          334 pages
          ISSN:1046-8188
          EISSN:1558-2868
          DOI:10.1145/3077622
          Issue’s Table of Contents

          Copyright © 2017 ACM

          © 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 April 2017
          • Accepted: 1 January 2017
          • Revised: 1 December 2016
          • Received: 1 June 2016
          Published in tois Volume 36, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader