Elsevier

Information Systems

Volume 83, July 2019, Pages 181-194
Information Systems

On the reproducibility of experiments of indexing repetitive document collections

https://doi.org/10.1016/j.is.2019.03.007Get rights and content

Highlights

  • We summarize the original results and motivate the proposed experimental setup.

  • We explain the replication framework, including datasets, query patterns, source code and scripts.

  • We detail all configuration parameters for each solution, explaining the better configurations.

  • We host the framework at GitHub, and publish it through Mendeley Data.

Abstract

This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work Claude et al., (2016). In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as reproducibility package.

Introduction

Scientific publications remain the most common way for disseminating scientific advances. Focusing on Computer Science, a scientific publication: paper, (i) exposes new algorithms or techniques for addressing an open challenge, and (ii) reports experimental numbers to evaluate these new approaches regarding a particular baseline. Thus, empirical evaluation is the stronger evidence of the achievements reported by a paper, and its corresponding setup is the only way that the scientific community has for assessing and (possibly) reusing such research proposals.

The ideal situation arises when a paper exposes enough information to make its research easily reproducible. According to the definition proposed by the Association for Computation Machinery (ACM) [1], an experiment is reproducible when an independent group of researchers can obtain the same results using artifacts which they develop independently. A weaker form of reproducibility is replicability. In this case, the original setup is reused, so an experiment is replicable when an independent group of researchers can obtain the same results using the author own artifacts. Finally, the ACM document [1] also defines the concept of repeatability. An experiment is repeatable if the group of researchers is able to obtain the same results reusing their original setup (including the same measurement procedures and the same system, under the same operating conditions). Non-repeatable results do not provide solid insights about any scientific question, so they are not suitable for publication. Thus, we can accept that research in Computer Science must be, at least, repeatable.

Although reproducibility is the ultimate goal, getting it is not always easy because it requires that a new experimental setup to be reconstructed from scratch. Implementing new algorithms, data structures, or other computational artifacts can be extremely complex, and tuning them may require setting many parameters which, in turn, can depend on particular system configurations or external tools. Besides, it may be necessary to use programs in the exact versions used originally [2]. As a consequence, reproducibility is time-consuming, error-prone, and sometimes just infeasible [3].

In practice, publishing replicable research2 would be an important step forward in Computer Science. According to Collberg et al. [4], replicating a computational experiment only needs that the source code and test case data have to be available. Although simple, many papers do not meet this precondition. The aforementioned paper [4] provides an interesting study involving 601 papers from conferences and journals. It shows that only 32.3% of the papers provide enough information and resources to build their executables in less than 30 min. Note that external dependencies must be accessible to compile these sources. Regarding test case data, Vandewalle et al. [5] report that more papers make data available, but it is mainly due to some of them use standard corpora in their corresponding experimental setups. However, the problem goes beyond. Once compiled, proper parameter settings must be set, and sometimes it is tricky to find this information in the papers. In conclusion, reproducibility research is currently challenging.

The situation is not quite different in Information Retrieval (IR), the field more related to the research proposed in our work. IR is a broad area of Computer Science focused primarily on providing the users with easy access to information of their interest [6]. As an empirical discipline, advances in IR research are built on experimental validation of algorithms and techniques [7], but reproducing IR experiments is extremely challenging, even when they are very well documented [8]. A recent Dagstuhl Seminar about reproducibility of data-oriented experiments in e-Science [9] concludes that IR systems are complex and depend on many external resources that cannot be encapsulated with the system. Besides, it reports that many data collections are private, and their size could be an obstacle even for obtaining them. Finally, it notes that some experimental assumptions are often hidden, making reproducibility a less achievable goal.

Following the initiative of some previous papers [10], [11], the aim of this work is to propose a detailed experimental setup that replicates the methods, experiments, and results on indexing repetitive document collections discussed in a previous work [12] (we will refer it as the parent paper). This experimental setup focuses on two dimensions: (i) the space used to preserve and manage the document collection, and the (ii) time needed to query this data.

The rest of the paper is organized as follows. In Section 2, we briefly describe all the inverted-index based variants and the self-indexes evaluated in the parent paper. We also explain the most relevant space/time tradeoffs reported from our experiments. The following sections are devoted to the reproducibility of these experiments. Section 3 details the fundamentals of uiHRDC, our replication framework, which comprises (i) the source code of all our indexing approaches, (ii) some test data collections, and (iii) the set of scripts that runs the main tasks of our experimental setup and generates a final report with the main figures of the parent paper. We focus both on discussing the workflow that leads the process of replicating all our experiments, and also in describing the structure of the uiHRDC framework. Section 4 describes the Docker3 container that allows this workflow to be easily reproduced in a tuned environment. Some brief conclusions are exposed in Section 5 to motivate the need of improving research reproducibility in Computer Science. Finally, Appendix A includes the actual compression ratios obtained for each technique (which complements the results shown in the figures throughout the paper), and Appendix B is devoted to explain how some of our self-indexes can be reused for dealing with universal (not document oriented) repetitive collections.

Section snippets

Background

Indexing highly repetitive collections has gained relevance because huge corpora of versioned documents are increasingly consolidated. Wikipedia4 is the most recognizable example, exposing millions of articles which evolve to provide the best picture of the reality around us. Each Wikipedia article comprises one version per document edition, and most of them are near-duplicates of others. Apart from versioned document collections, other applications perform on

The uiHRDC framework

Our experimental framework was named uiHRDC (universal indexes for Highly Repetitive Document Collections). It is licensed under the GNU Lesser General Public License v2.1 (GNU LGPL), is hosted at a GitHub repository,15 and it is also published through Mendeley Data [34]. It includes all the required elements to reproduce the main experiments in our original paper including datasets, query patterns, source code, and scripts.

The general structure of the uiHRDC

Deploying the experimental setup with docker

For those readers interested in reproducing our experiments in their machine, we provide a Docker environment that will allow them to:

  • 1.

    Reproduce our test framework. We create a docker image with Ubuntu 14 (ubuntu:trusty) that includes all the libraries and software requirements to compile/run our indexing alternatives and also build the final report. These includes packages that are installed via apt such as: gcc-multilib, g++-multilib, cmake, libboost-all-dev, p7zip-full, openssh-server,

Conclusions and future work

We have briefly described all the techniques used in our original paper. In total, we had twenty two variants of posting list representations available that were used to create both non-positional and positional inverted indexes. In addition, we had six self-indexing techniques. Since those techniques have their own configuration parameters, and in some cases dependencies of libraries/software, it would be hard to replicate the results and to reuse our techniques by simply reading our original

Revision comments

We would like to thank the authors for providing this valuable reproducibility platform, which allows both reproducing the results of its parent paper and encouraging the evaluation of new indexes for highly repetitive document collections in an easy and systematic way. Undoubtedly, this will be a topic of active research in the coming years given, for example, the cheapening of gene sequencing technology. Hence, the public availability of this type of benchmarking platform is becoming

Acknowledgments

This paper is funded in part by European Union’s Horizon 2020 research and innovation programme, Spain under the Marie Sklodowska-Curie grant agreement No 690941 (project BIRDS). Antonio Fariña is funded by Xunta de Galicia/FEDER-UE, Spain [CSI: ED431G/01 and GRC: ED431C 2017/58]; by MINECO-AEI/FEDER-UE, Spain[ETOME-RDFD3: TIN2015-69951-R]; and by MINECO-CDTI/FEDER-UE, Spain [INNTERCONECTA: uForest ITC-20161074]. Miguel A. Martínez-Prieto is funded by MINECO-AEI/FEDER-UE, Spain [Datos 4.0:

References (35)

  • Baeza-YatesR.A. et al.

    Modern Information Retrieval

    (2011)
  • LinJ. et al.

    Toward reproducible baselines: The open-source ir reproducibility challenge

  • FerroN. et al.

    Rank-biased precision reloaded: Reproducibility and generalization

  • FerroN. et al.

    Increasing reproducibility in Ir: findings from the dagstuhl seminar on “reproducibility of data-oriented experiments in e-science”

    ACM SIGIR Forum

    (2016)
  • NavarroG.

    Compact Data Structures – A practical approach

    (2016)
  • TrotmanA.

    Compression, simd, and postings lists

  • OttavianoG. et al.

    Partitioned elias-fano indexes

  • Cited by (3)

    • A large reproducible benchmark of ontology-based methods and word embeddings for word similarity

      2021, Information Systems
      Citation Excerpt :

      Likewise, all experiments reported in our primary work [27] were recorded with the Reprozip long-term reproducibility tool [30]. Before the publication of this work, the only large reproducible experimental surveys on word similarity reported in the literature were those introduced by Lastra-Díaz and García-Serrano [1, 9, 10] in another reproducibility paper [20] belonging to this same reproducibility initiative [31], in which we also find other works such as those introduced by Wolke et al. [32] and Fariña et al. [33]. However, there is neither joint reproducible benchmarks on word embeddings and ontology-based semantic similarity measures nor other ones evaluating the latest family of methods on so large count of datasets as those evaluated by our primary work [27].

    1

    Reproducibility reviewer.

    View full text