Abstract
State-of-the-art encoders for inverted indexes compress each posting list individually. Encoding clusters of posting lists offers the possibility of reducing the redundancy of the lists while maintaining a noticeable query processing speed.
In this article, we propose a new index representation based on clustering the collection of posting lists and, for each created cluster, building an ad hoc reference list with respect to which all lists in the cluster are encoded with Elias-Fano. We describe a posting lists clustering algorithm tailored for our encoder and two methods for building the reference list for a cluster. Both approaches are heuristic and differ in the way postings are added to the reference list: according to their frequency in the cluster or according to the number of bits necessary for their representation.
The extensive experimental analysis indicates that significant space reductions are indeed possible, beating the best state-of-the-art encoders.
- Charu Aggarwal and Chandan Reddy. 2013. Data Clustering: Algorithms and Applications. Chapman and Hall/CRC. Google ScholarCross Ref
- Vo Ngoc Anh and Alistair Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval Journal 8, 1 (2005), 151--166. Google ScholarDigital Library
- Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2 (2010), 131--147. Google ScholarDigital Library
- David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07). 1027--1035. Google ScholarDigital Library
- Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable K-means++. In Proceedings of the Very Large Database Endowment (PVLDB’12), Vol. 5. 622--633. Google ScholarDigital Library
- Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Y. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM’03). 426--434. Google ScholarDigital Library
- Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at twitter. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 1360--1369. Google ScholarDigital Library
- Stefan Büttcher and Charles Clarke. 2007. Index compression is good, especially for random access. In Proceedings of the 16th ACM International Conference on Information and Knowledge Management (CIKM’07). 761--770. Google ScholarDigital Library
- Stefan Büttcher, Charles Clarke, and Gordon Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. Google ScholarDigital Library
- Surajit Chaudhuri, Kenneth Church, Arnd Christian König, and Liying Sui. 2007. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 663--670. Google ScholarDigital Library
- David Clark. 1996. Compact Pat Trees. Ph.D. Dissertation. University of Waterloo. Google ScholarDigital Library
- Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko, Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin, Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, and Ning Zhang. 2013. Unicorn: A system for searching the social graph. In Proceedings of the Very Large Database Endowment (PVLDB’13), Vol. 6. 1150--1161. Google ScholarDigital Library
- Renaud Delbru, Stéphane Campinas, and Giovanni Tummarello. 2012. Searching web data: An entity retrieval and high-performance indexing model. Journal of Web Semantics 10 (2012), 33--58. Google ScholarDigital Library
- Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM Conference on Knowledge Discovery and Data Mining (KDD’16). Google ScholarDigital Library
- Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM) 21, 2 (1974), 246--260. Google ScholarDigital Library
- Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).Google Scholar
- Paolo Ferragina, Igor Nitto, and Rossano Venturini. 2011. On optimally partitioning a text to improve its compression. Algorithmica 61, 1 (2011), 51--74.Google ScholarDigital Library
- Michael Garey and David Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company. Google ScholarDigital Library
- Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing relations and indexes. In Proceedings of the 14th International Conference on Data Engineering (ICDE’98). 370--379. Google ScholarDigital Library
- Hoang Thanh Lam, Raffaele Perego, Nguyen Thoi Minh Quan, and Fabrizio Silvestri. 2009. Entry pairing in inverted file. In Proceedings of the 10th International Conference Web Information Systems Engineering (WISE’09). 511--522. Google ScholarDigital Library
- Daniel Lemire and Leonid Boytsov. 2013. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2013), 1--29. Google ScholarDigital Library
- Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory (IT) 28 (1982), 129--137. Google ScholarDigital Library
- Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- Alistair Moffat and Lang Stuiver. 2000. Binary interpolative coding for effective index compression. Information Retrieval Journal 3, 1 (2000), 25--47. Google ScholarDigital Library
- Giuseppe Ottaviano, Nicola Tonellotto, and Rossano Venturini. 2015. Optimal space-time tradeoffs for inverted indexes. In Proceedings of the 8th Annual International ACM Conference on Web Search and Data Mining (WSDM’15). 47--56. Google ScholarDigital Library
- Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’14). 273--282. Google ScholarDigital Library
- Dan Pelleg and Andrew Moore. 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). 727--734. Google ScholarDigital Library
- Stephen Robertson and Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science 27, 3 (1976), 129--146.Google ScholarCross Ref
- David Salomon. 2007. Variable-Length Codes for Data Compression. Springer. Google ScholarDigital Library
- Fabrizio Silvestri. 2007. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR’07). 101--112. Google ScholarDigital Library
- Fabrizio Silvestri and Rossano Venturini. 2010. VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). 1219--1228. Google ScholarDigital Library
- Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering techniques. In Proceedings of the 6th Annual Conference on Knowledge Discovery and Data Mining (KDD’00), Workshop on Text Mining. 109--111.Google Scholar
- Alexander Stepanov, Anil Gangolli, Daniel Rose, Ryan Ernst, and Paramjit Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM’11). 317--326. Google ScholarDigital Library
- Sebastiano Vigna. 2013. Quasi-succinct indices. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM’13). 83--92. Google ScholarDigital Library
- Ian Witten, Alistair Moffat, and Timothy Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). Morgan Kaufmann. Google ScholarDigital Library
- Rui Xu and Donald Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks (NN) 16, 3 (2005), 645--678. Google ScholarDigital Library
- Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 401--410. Google ScholarDigital Library
- Zhaohua Zhang, Jiancong Tong, Haibing Huang, Jin Liang, Tianlong Li, Rebecca J. Stones, Gang Wang, and Xiaoguang Liu. 2016. Leveraging context-free grammar for efficient inverted index compression. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 275--284. Google ScholarDigital Library
- Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM Computing Surveys (CSUR) 38, 2 (2006), 1--56. Google ScholarDigital Library
- Marcin Zukowski, Sándor Héman, Niels Nes, and Peter Boncz. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). 59--70. Google ScholarDigital Library
Index Terms
- Clustered Elias-Fano Indexes
Recommendations
A Hybrid BitFunnel and Partitioned Elias-Fano Inverted Index
WWW '19: The World Wide Web ConferenceSearch engines encounter a time vs. space trade-off: search responsiveness (i.e., a short query response time) comes at the cost of increased index storage. We propose a hybrid method which uses both (a) the recently published mapping-matrix-style index ...
Techniques for Inverted Index Compression
Invited Tutorial and Regular PapersThe data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance ...
Partitioned Elias-Fano indexes
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalThe Elias-Fano representation of monotone sequences has been recently applied to the compression of inverted indexes, showing excellent query performance thanks to its efficient random access and search operations. While its space occupancy is ...
Comments