1 Introduction

Information retrieval (IR) systems are nowadays challenged with increasingly complex search tasks where information about how users interact with the system plays a central role to adapt them to the needs and interests of users. A lot of research efforts focused on enhancing user engagement and retrieval effectiveness by exploiting information about user-system interactions available, for example, from the query logs of Web search engines (Silvestri 2009; Lucchese et al. 2013; Mehrotra et al. 2018). The ability to interpret and learn from the complex and noisy traces of user-system interactions is fundamental for IR advance. The number of clicks on a given query-result pair, the click-through rate (CTR), and the dwell time, are examples of actionable information to improve various aspects of IR systems (Chuklin et al. 2015). In the context of learning to rank (LtR) (Liu 2009), user actions recorded in query logs are used to extract several important features (Agichtein et al. 2006) with proper adjustments in order to remove the position bias (Joachims and Radlinski 2007; Joachims et al. 2017); the same information can be used to model and predict user interests (White et al. 2009).

In this paper, we build on our previous work which explored how to exploit user interaction for LtR (Ferro et al. 2017) and how to apply curriculum learning (CL) and continuation methods (CM) to LtR (Ferro et al. 2018). We investigate here whether exploiting in conjunction these two techniques—i.e., the explicit knowledge of the underlying user-interaction model and the possibility of targeting different objective functions—makes it possible to further improve the quality of the ranking model learned as measured by different metrics. In particular, our investigation focuses on improving LambdaMart, a state-of-the-art LtR algorithm (Burges 2010; Wu et al. 2010), and tries to answer the following research question: “how can user interaction and continuation methods be jointly exploited to improve LtR effectiveness?”.

The user interaction model is integrated directly in the discount component of the loss function used by LambdaMart at training time. Specifically, we model the user dynamics in scanning a ranked result list with Markov chains trained on query log data and we modify the loss function to embed this trained Markov chain. Moreover, since different loss functions can capture different relevance signals hidden in the features modeling queries and documents, we instantiate the training process as a continuous learning path across different loss functions, starting from easier functions considering retrieval quality, e.g., recall or mean square error (MSE), up to the most complex one embedding our Markovian user model. Learning through a curriculum is an interesting idea, borrowed from cognitive science, according to which a complex learning process is designed as a multi-step training path (Bengio et al. 2009). Initially, the learning algorithm is trained over simple training examples and smooth loss functions, and then it is progressively fine-tuned so as to deal with examples and loss functions of increasing complexity. continuation methods (CM) (Coleman and Wu 1996; Allgower and Georg 1980) specifically refer to approaches that instead of optimizing difficult objective functions, i.e., objective functions with many local minima, transform the original function into a class of smoother or easier to minimize functions. The intuition behind this approach is that a smooth version of the target objective function can quickly and effectively drive the learning algorithm to a promising area of the search space that possibly includes the global optimum.

The contributions of the paper are the following:

  • modelling user dynamics into LambdaMart: instead of proposing new features to account for this particular aspect of user behavior and then training the LtR model on this extended set of features, we model the user dynamics in scanning a ranked result list with Markov chains trained on query log data. Moreover, we define the normalized Markov cumulated gain (nMCG) measure which embeds this trained Markov chain, and we modify the LambdaMart objective function to rely on nMCG instead of the usual normalized discounted cumulated gain (nDCG) (Järvelin and Kekäläinen 2002);

  • user interaction-based curriculum learning for LambdaMart: designing a curriculum of objective functions of increasing complexity has shown to be a promising research direction and, therefore, we extend it by exploring whether nMCG, i.e. a user interaction-based objective function, can provide further gain when building a curriculum based on CM;

  • extensive experimentation: we rely on the publicly available MSLR-WEB10K and MSLR-WEB30K datasets (Qin and Liu 2013) to conduct an extensive validation of the proposed approaches and derive key insights about them.

The paper is organized as follows: Sect.  2 introduces the related works; Sect. 3 presents our methodology; Sect. 4 describes the experimental setup and discusses the experimental findings; finally, Sect. 5 draws some conclusions and outlooks for future work.

2 Related work

2.1 User dynamics for LtR

LtR datasets already contain some user-related features. For example, the MSLR-WEB30K and MSLR-WEB10K datasets provided by Microsoft (Qin and Liu 2013) contain the following user-related features: query-url click count, i.e. the click count of a query-url pair at a search engine in a period; url click count, i.e. the click count of a URL aggregated from user browsing data in a period; and, url dwell time, i.e. the average dwell time of a URL aggregated from user browsing data in a period. As an empirical evidence of the importance of the user interaction features, we trained a LambdaMart (Burges 2010; Wu et al. 2010) model on the MSLR-WEB10K LtR dataset with and without these user-interaction features: the nDCG measured on the test set without such features drops from 0.4636 to 0.4410.

Jiang et al. (2013) provide a good overview of how search logs and user click data can be exploited to improve search in many respects, among which rankings of documents. Joachims et al. (2005) explored the use of clickthrough data as user preferences among documents and implicit feedback, while Joachims (2002) trained a support vector machine (SVM) ranker using such data. Dou et al. (2008) followed up by comparing the performance of RankNet (Burges et al. 2005) trained on aggregated clickthrough data with respect to using human relevance judgments.

Click models (Craswell et al. 2008; Chuklin et al. 2015) are another active area of research aimed at training models able to predict the probability of clicking on a document for a given query. Click models have not been directly exploited within LtR algorithms, but Chapelle and Zhang (2009) proposed to use them to predict relevance labels. Then, Chapelle and Zhang have shown that, when these predicted relevance labels are used to train the boosting algorithm QBRank (Zheng et al. 2007), they give just a 4% performance loss, while when used together with editorial judgements they produce a 2% performance gain. Recent works, e.g. by Zoghi et al. (2017), explore how to best exploit click models in online LtR.

Online LtR exploits click logs and users interactions data to infer preferences between rankers in order to make online LtR faster (Hofmann et al. 2013; Schuth et al. 2016; Wang et al. 2018). As another example, Lerot (Schuth et al. 2013) proposes an online LtR algorithm which uses clicks as feedback for interleaving methods.

To the best of our knowledge, the integration of the user dynamics directly within the objective function of an offline LtR algorithm is novel and our previous work (Ferro et al. 2017) was the first to explore it.

2.2 Curriculum learning for LtR

Continuation methods (CM) have shown to be effective in the optimization of complex objective functions (Coleman and Wu 1996; Allgower and Georg 1980). When the target objective function has many local minima, its direct optimization may lead to a “bad” sub-optimal result. In Continuation Methods, a multi-step optimization process is thus followed. At each step a different smooth function is optimized, i.e. a function easy to minimize that approximates the desired target function. The complexity and difficulty of the optimization function chosen is increased until the original target function is finally used. The intuition behind this approach is that the smooth versions of the original target function can provide a global representation of the search space, which highlights the regions where the best local optima and the global optima are located.

Curriculum learning (CL) can be seen as a particular instantiation of continuation methods (Bengio et al. 2009). The basic idea is to organize the training examples in such a way that the easiest training examples are presented first and the complexity of the following ones is gradually increased. This strategy guides the training and allows the learner to exploit previously seen concepts to ease the acquisition of subsequent more difficult concepts. Thus, Curriculum Learning can be seen as a process exploiting a sequence of training criteria. Each training criterion corresponds to a different set of examples that can be differently weighted based on their complexity. At the subsequent steps, slightly more difficult examples are assigned with new weights. This is different from boosting approaches commonly adopted also in LtR, as in boosting the instance weights are determined during the training according to the mis-classification risk, while in Curriculum Learning weights are predetermined according to a training schedule.

Both approaches have been used successfully in several fields. Continuation Methods are applied when the function to optimize is not convex, as for example non linear optimization problems (Chen and Xiu 1999) in computational chemistry (Coleman and Wu 1996; Moré and Wu 1997), computational physics (Rabani et al. 2002) and automatic control (Nagamune 2003). In machine learning (ML), continuation methods are used with semi-supervised SVM, showing that this approach leads to lower test errors (Chapelle et al. 2006). CL is often applied with neural networks (NN)s (NN)s and deep NNs, as for example in Collobert et al. (2011), Hu et al. (2014), where CL is used for natural language processing (NLP), or in Chen and Gupta (2015), where CL is exploited for representations of images. In Bengio et al. (2009), curriculum learning is exploited to train a language model (not used for ranking) with a deep neural network. In Qu et al. (2018) a curriculum learning approach is applied to sort the training data to ease the learning of nodes representations in a heterogeneous star network.

To the best of our knowledge, our previous work (Ferro et al. 2018) has been the first to explore how CM and CL methods can be exploited in combination with LtR algorithms. We found that CM is more effective than CL on this task and this is why in this paper we rely only on CM.

3 Methodology

A LtR algorithm exploits a ground-truth set of training examples in order to learn a document scoring function \(\sigma \) (Liu 2009). Such training set is composed of a large collection of queries \({\mathcal {Q}}\), where each query \(q\in {\mathcal Q}\) is associated with a set of assessed documents \(D = \{d_{0}, d_{1}, \ldots \}\). Each document \(d_i\) is labeled by a relevance judgment \(l_{i}\) according to its relevance to the query q. These labels induce a partial ordering over the assessed documents, thus defining an ideal ranking which the LtR algorithm aims at approximating. Each query-document pair \((q,d_i)\) is represented by a vector of features x, able to describe the query (e.g., its length), the document (e.g., the in-link count) and their relationship (e.g., the number of occurrences of each query term in the document).

According to the above notation, a LtR algorithm produces a function \(\sigma (x)\) predicting a relevance score for the input feature vector x. Such function \(\sigma (x)\) is finally used online to compute a score for documents matching a query and ranking them accordingly.

Since IR measures are not differentiable, their optimization is very challenging. To address this issue, the state-of-the-art solution is the LambdaRank gradient approximation (Burges et al. 2006), which is based on the idea of measuring the cost variation after swapping any two documents in a given result list. As discussed in Donmez et al. (2009), this approach can be applied to several IR measures and it is capable of accurately discovering local optima.

LambdaRank can be summarized as follows. Given a query q and two candidate documents \(d_i\) and \(d_j\) in the training set with relevance labels \(l_i\) and \(l_j\) respectively, \(s_i\) and \(s_j\) are the scores currently predicted for the documents. The lambda gradient of any given IR quality function Q, as defined in Burges (2010), is:

$$\begin{aligned} \lambda _{ij} = \frac{\partial Q_{ij}}{\partial \left( s_i-s_j\right) } \ = \ \text {sgn}(y_i-y_j) \left| \varDelta Q_{ij} \cdot \frac{1}{1+e^{s_i-s_j}}\right| \end{aligned}$$

where the sign is determined by the document labels only, the first factor \(\varDelta Q\) is the quality variation when swapping scores \(s_i\) and \(s_j\), and the second factor is the derivative of the RankNet cost (Burges et al. 2005), which minimizes the number of disordered pairs. When \(l_i \ge l_j\), the quality Q increases with the score of document \(d_i\). The larger the quality variation \(\varDelta Q\), the higher the document \(d_i\) should be scored. Note that the RankNet multiplier fades \(\varDelta Q\) if documents are scored correctly, i.e. \(s_i \ge s_j\), and boosts \(\varDelta Q\) otherwise. The lambda gradient for a document \(d_i\) is computed by marginalizing over all possible pairs in the result list: \(\lambda _i = \sum _j \lambda _{ij}\). LambdaRank uses nDCG as Q and so \(\varDelta Q\) is the variation in nDCG caused by the swap of two documents.

3.1 User dynamics for LtR

We summarize here our methodology (Ferro et al. 2017) for including the user dynamics into the above discussed LambdaRank algorithm.

Yilmaz et al. (2010) have shown that effectiveness is often measured as the inner product of a relevance vector \({\mathcal {J}}\) and a discounting vector \({\mathcal {D}}\). The elements \({{\mathcal {J}}}_i\) account for the benefit of ranking an high-quality document at the i-th position of the search engine result page (SERP), while \({{\mathcal {D}}}\) denotes such contribution for low-ranked documents. For instance, according to Discounted Cumulated Gain (DCG) (Järvelin and Kekäläinen 2002) the i-th element of \(\mathcal J\) is defined as \({{\mathcal {J}}}_i=2^{l_i}-1\), where \(l_i\) is the relevance label of the i-th ranked document, and \({{\mathcal {D}}}_i = \log (i+1)\). The underlying assumption is that low-ranked documents receive less attention by the user and therefore they contribute less to the user-perceived quality of the SERP. Defining a proper quality metric is crucial both for evaluating retrieval systems and for learning effective ranking models as such metrics are used to drive the training process.

Most metrics assume that the user analyzes a SERP from top to bottom, and therefore define a decreasing discount vector, however some user studies suggest that the probability of observing a result depends on the quality of the documents ranked higher: if the user finds a relevant document at position i it is less likely that he will inspects the document at position \(i+1\) (Zhang et al. 2010). However, the user behavior is more complex as he/she can move forward and backward, can jump from one document to any other and visit already visited documents, as suggested by Ferrante et al. (2014), Sakai and Dou (2013).

Our work stems from the simple observation that the user behavior in visiting a SERP differs depending on the query type and the number and position of the relevant results. For example, it is likely that on a SERP with a single highly relevant result in the first position the user assumes a navigational behavior, while a SERP with several relevant results may likely correspond to an informational query, where a more complex SERP visiting behavior can be observed (Broder 2002). Since at training time a list-wise LtR algorithm such as LambdaMart is aware of the number and distribution of relevance labels associated with the training samples for each query, we suppose that it can profit from the knowledge of the user dynamics associated with the specific kind of query.

We model the user dynamics with a Markovian process (Norris 1998), where the user scans the ranked documents in the SERP according to possibly complex paths, moving both forward and backward. Let us denote by \(X_1, X_2, \dots \) the sequence of random variables representing the rank positions in \(\mathcal {R}=\{1,2,\dots ,R\}\) visited by the user, where \(X_n = j\) means that the \(n^{th}\) document visited by the user is at rank j. Moreover, we assume that the probability to move from the document at rank i to the document at rank j depends on the document at rank i only and is independent of all the previously visited documents. Finally, we denote by P the transition matrix whose entries represent the transition probabilities \(P = (p_{ij} :i,j \in \mathcal {R})\), where \(p_{ij} = \mathbb {P}[X_{n+1}=j|X_n=i]\). The sequence of random variables \((X_n)_{n>0}\) defines a discrete-time homogeneous Markov chain.

If we consider the transition matrix and compute \(P^n\), we obtain the probability to see the Markov chain in a given state after n steps. By assuming that P is irreducible, which means that the probability of passing in a finite number of steps from a given document in the ranked list to any other document is positive, it can be proven that P admits a unique stationary distribution \(\pi P = \pi \) which, under the hypothesis of P being aperiodic (Norris 1998), is the limit of the n-step transition probabilities, \(p_{ij}^{(n)} \rightarrow \pi _{j}\) as \(n \rightarrow \infty \) for all i, j.

When extending this analysis to a long-term query log, we can consider the behavior recorded for each user as a different observation of the same stochastic process, and the resulting stationary distribution can be considered as an aggregated representation of user dynamics. Notice that the convergence is typically very fast, since already after \(n \approx 10\) iterations of the power method we obtain an accurate estimation of \(\pi \).

Since we observe that the behavior of users change depending on the number of relevant documents in the SERP, we can classify queries on the basis of the number of relevant documents returned and estimate different transition matrices \(\hat{P}\) for different classes of queries. Specifically, we first aggregate the dynamics of different users on the basis of the typology of query, then we adopt the maximum likelihood estimator approach (Teodorescu 2009) on the aggregated data:

  1. 1.

    for each \(i \in \mathcal {R}\) let \(v_i\) be the number of times that the users visited the document at rank i given the query;

  2. 2.

    if \(v_i=0\), then \(\hat{p}_{ij}=0\) for all \(j\ne i\) and \(\hat{p}_{ii}=1\);

  3. 3.

    if \(v_i>0\), let \(v_{ij}\) be the number of transitions from the document at rank i to the document at rank j, then \(\hat{p}_{ij}=\frac{v_{ij}}{v_i}\).

Figure 1a plots the stationary distributions obtained from the Yandex query log described in Sect. 4.1. We split queries in the log on the basis of the number of relevant documents retrieved and we computed the stationary distribution by aggregating the sessions of all the users sharing the same typology of query. When considering queries with just one relevant retrieved document, i.e. the red line with circle markers in Fig. 1a, the user dynamics exhibit a spike with respect to the first rank position, while for queries without any relevant documents or with more than one relevant document, i.e. the blue lines, the probability tends to be distributed more uniformly, meaning that the user is exploring the whole SERP.

Fig. 1
figure 1

Stationary distributions for queries retrieving different numbers of relevant documents (a), and stationary distribution with its fitted curve and DCG discount for navigational (b), and informational queries (c)

We focus on these two distinct macroscopic behavior types, and, for the sake of simplicity, we call navigational the queries where users concentrated on just the first item, and we consider all the other queries as informational since users tend to visit more documents. To embed user dynamics in the LambdaMart cost function, as detailed in the next section, we abstract these two observed behavioral types (navigational and informational) by fitting a curve to the corresponding stationary distributions.

On the basis of the above experimental observations, we claim that the user dynamics can be described as a mixture of the navigational and informational behavior. The navigational component is represented by the inverse of the rank position i, \(\frac{1}{i}\), while the informational component is linear with respect to the rank position i. Therefore, we model the user dynamics as

$$\begin{aligned} \delta (i) = \alpha i^{-1} + \beta i + \gamma \end{aligned}$$

where the parameters \(\alpha \), \(\beta \) and \(\gamma \) are calibrated in order to fit the estimated stationary distributions computed on the Yandex dataset.

Figure 1b, c show the stationary distributions together with the fitted curves for the navigational and informational cases, respectively. In Fig. 1b the stationary distribution is the same reported in the red line of Fig. 1a, while to compute the stationary distribution reported in Fig. 1c we aggregate all the user dynamics corresponding to the other queries, i.e. queries without relevant documents or with more than one relevant document.

Since it is clearly visible that users behave differently depending on the query at hand, if we consider the notation introduced at the beginning of Sect. 3, we might require different scoring functions \(\sigma _i\) depending on the query type, and a classifier c able to discriminate between queries types. However, at query time, the classifier c can exploit only query-based features \(x_q\) as, in general, documents have not yet been retrieved, and neither their relevance nor their features are available. There is a large literature of query classification for different tasks, and query topic categorization is well addressed within Web companies to increase effectiveness, efficiency, and revenue potential in general-purpose Web search engines (Beitzel et al. 2005). Therefore, building such a classifier \(c(x_q)\) is feasible, but beyond the purpose of this paper.

However, the desired scoring model should be a function of both the query classification \(c(x_q)\) and the class-dependent document scorer \(\sigma _{c(x_q)}\). Since both functions depend on x, they can be replaced trivially with a single scoring function \(\sigma (x)\). This allows us to avoid the design of a query classifier, and let the LtR algorithm to learn the query classification and document scoring at the same time.

In the following we will illustrate the approach adopted in order to include the user dynamics in LambdaMart. Roughly speaking, the idea is to consider the user dynamics as a discount function and to optimize a new evaluation metric based on the stationary distribution and query class. A similar approach is presented in Yilmaz et al. (2010), where the expected browsing utility is estimated on log data and used as discount function.

The user dynamics defined above can actually be considered as a discounting vector to be exploited in any given quality metric. Differently from other approaches, the user dynamics is defined on the basis of two different query classes which exhibit a different user behavior. Figure 1b, c show how different is the derived user dynamics w.r.t. the DCG discounting component. Below we discuss how \(\delta \) can be exploited in a state-of-the-art LtR algorithm.

We enhance the existing LambdaMart algorithm by replacing the above Q with a new quality measure which integrates the proposed user dynamics \(\delta \). This new measure is called normalized Markov cumulated gain (nMCG) and it is defined as follows:

$$\begin{aligned} \text {nMCG@k} = \frac{\sum _{i\le k} \left( 2^{l_i}-1\right) \ \cdot \ \delta ^c(i)}{\sum _{h\le k, \hbox {sorted by }\hbox {l}_{\mathrm{h}}} \left( 2^{l_h}-1\right) \cdot \delta ^{c}(h)} \end{aligned}$$

where \(l_i\) is the relevance label of the i-th ranked document and \(\delta ^c(i)\) is the user dynamics function at rank i relative to the query class c, either navigational or informational. Basically, nMCG can be seen as an extension of nDCG, where the discount function is defined by the user dynamic and depends on the query class. Moreover, since \(\delta ^c\) depends on the query class, i.e. depends on the query q, we are optimizing two different variants of the same quality measure nMCG across the training dataset. Finally, \(\varDelta \text {nMCG@k}_{ij}\) can be computed efficiently as follows:

$$\begin{aligned} \varDelta \text {nMCG@k}_{ij} = \frac{ - \left( 2^{l_i} - 2^{l_j}\right) \left( \delta ^c(i) - \delta ^c(j)\right) }{\sum _{h\le k, \hbox {sorted by }\hbox {l}_{\mathrm{h}}} \left( 2^{l_h}-1\right) \cdot \delta ^c(h)}. \end{aligned}$$

Hereinafter, we use nMCG-MART to refer to the described variant of LambdaMart aimed at maximizing nMCG.

Note that the query class is known at training time, and therefore the algorithm can optimize the proper user dynamics \(\delta ^c\). Nor the document relevance, neither the query class information are available at test time, therefore the algorithm should, at the same time, classify queries and rank documents according to the different class-based dynamics \(\delta ^c\).

3.2 Continuation methods for LtR

Applying a Continuation Method to a cost function C means to define a sequence of cost functions \(C_{\gamma }\) with \(\gamma \in [0,1]\), such that by increasing \(\gamma \), the complexity of the function increases. Therefore, \(C_0\) represents a highly smoothed version of the original cost function corresponding to \(C_1\).

We implement Continuation Methods in the contest of forests of decision trees as a two step learning process, where the initial trees are trained by minimizing a cost function \(C_0\) and the remaining trees are trained by optimizing \(C_1\). We did not explore in this work more complex multi-stage scenarios, which can trivially extend this work. We denote continuation methods as \(C_0T_0\_C_1T_1\), meaning that the first \(T_0\) trees of the ensemble minimize \(C_0\), while the remaining \(T_1\) trees minimize \(C_1\), additively refining the prediction of the first \(T_0\) trees.

In order to apply a CM to LambdaMart we need to smooth the LambdaMart loss function. Smoother variants of nDCG have been previously proposed, e.g. SoftNDCG (Taylor et al. 2008); however, their performance did not prove to be significantly better than the original LambdaMart loss function. Therefore, we look at two different smoother replacements of the nDCG measure.

The first option we consider is to use as \(C_0\) the mean square error (MSE) as a smooth variant of the target nDCG measure. Even if MSE is easily differentiable and an MSE equal to 0 corresponds to the maximum nDCG value, the MSE cost cannot be considered a smooth approximation of nDCG. Nevertheless, Gradient Boosted Regression Trees that minimize MSE are known to perform well, even at optimizing nDCG. The rationale of using \(C_0\) equal to MSE is that of using a smooth function that alone exhibits good performance as a good seed for the refinement that the optimization of nDCG or nMCG may provide.

The second option we consider is to use Recall@k as \(C_0\). Recall is not a differentiable function either, as it suffers the same issues as nDCG. Still, it is an easier function to be optimized because it considers binary relevance instead of graded, and it discriminates between documents above or below the cut-off k without further discounting according to document ranks. The rationale is to train the first portion of the forest in order to place the most relevant documents above the cut-off, and then to complete the training with the target metric nDCG or nMCG to adjust their relative ranking.

Since Recall is not differentiable, we devised a LambdaMart-based approach similar to nMCG-MART. We define \(\varDelta \text {nRecall@k}_{ij}\) as the normalized change in Recall@k when swapping the i-th and j-th documents in the current ranking. Given a relevance binarization thresholds \(\rho \), \(\varDelta \text {nRecall@k}_{ij}\) is defined as follows:

$$\begin{aligned} \varDelta \text {nRecall@k}_{ij} = \frac{ - \left( \mathbb {1}_{l_i\ge \rho } - \mathbb {1}_{l_j\ge \rho } \right) \left( \mathbb {1}_{i\ge k} - \mathbb {1}_{j\ge k} \right) }{\sum _{h} \mathbb {1}_{l_h\ge \rho } }, \end{aligned}$$

where \(\mathbb {1}_{p}\) is the indicator function evaluating to 1 if predicate p is true and 0 otherwise.

Finally, in addition to \(C_1\) being equal to nDCG, we also consider the case where nMCG-MART is used as \(C_1\). Our goal is to evaluate the added value of exploiting the user dynamics in producing the final ranking, even when the first trees of the forest where built by optimizing a different metric.

In conclusion, by evaluating different combinations of \(C_0\) and \(C_1\), we are comparing the effectiveness of LambdaMart and nMCG-MART, and we are assessing the benefit they can receive by a pre-training on a different metric, which actually allows the two algorithms to initiate their search process from a different point in the solution space, possibly leading to a better final solution.

4 Experimental evaluation

4.1 Experimental setup

We remark that since there is no publicly available dataset providing, at the same time, user session data, document relevance and query-document pairs features, we have to use two different datasets in our analysis. A first dataset providing user session data is used for the user dynamics derivation, while standard LtR datasets are used for learning the ranking models.

We calibrate the proposed user model on the basis of the click log dataset provided by Yandex (Serdyukov et al. 2012) (http://imat-relpred.yandex.ru/en/). The dataset is composed of 340,796,067 records with 30,717,251 unique queries, retrieving 10 URLs each. We used the training set, which consists of 5,191 assessed queries with binary judgments, corresponding to 30,741,907 records. Notice that 9% of the sessions corresponds to navigational queries, while the remaining 91% corresponds to informational ones.

The accuracy of the proposed algorithms is instead evaluated on two public LtR datasets, MSLR-WEB30K and MSLR-WEB10K, provided by Microsoft (Qin and Liu 2013). Dataset MSLR-WEB30K encompasses 31,531 queries from the Microsoft Bing search engine for a total of 3,771,125 query-document pairs represented by 136 features. The dataset is provided as a 5-fold split. The MSLR-WEB10K dataset contains instead 10,000 queries samples at random from the previous dataset.

The query-document pairs in both the datasets are assessed with integer relevance labels in the range [0, 4].

In order to classify queries as navigational or informational we adopt the following criterion: a query is considered as navigational if it contains only one result with relevance label \(\ge 3\). Approximatively \(15\%\) of the queries in the Microsoft datasets are classified according to this heuristic as navigational queries, which is a fraction quite similar to the one measured on the Yandex dataset. The exact number of navigational and informational queries for each fold is reported in Table 1 for MSLR-WEB10K and in Table 2 for MSLR-WEB30K.

Table 1 Number of navigational and informational queries included in MSLR-WEB10K
Table 2 Number of navigational and informational queries included in MSLR-WEB30K

We used LambdaMart as the reference LtR algorithm. Specifically, we used (and modified) the open source implementation of LambdaMart provided in the LightGBM gradient boosting frameworkFootnote 1 and swiped the hyper parameters with cross-validation so as to maximize the average performance over the validation folds. Learning rate \(\nu \) was varied in the set \(\{0.01, 0.05, 0.1, 0.5\}\), while the maximum number of leaves L in the set \(\{8, 16, 32, 64\}\), with best results observed with the setting \(\nu =0.05\) and \(L=64\). Notice that in the following sections, whenever we refer to the baseline we mean a LambdaMart model trained using the above settings and with nDCG@10 as effectiveness measure. We keep the same setting for all the other proposed models, which are trained with the same values of learning rate and number of leaves, and with a total of \(T=500\) trees. We implemented the proposed nMCG and CM algorithms as LightGBM plug-ins providing the required derivative and hessian matrices for the gradient boosting optimization. The source code of our implementation will be made available upon publication of the paper.Footnote 2

For continuation methods approaches, we report the results achieved when training a first set of \(T_0\) trees with the first objective function \(C_0\) and the remaining \(T_1 = T-T_0\) by exploiting a different objective function \(C_1\). We explored different switching points for the cost functions at \(T_0\) equals to 100, 200, 300, 400 trees, while the total number of trees T is always equal to 500. As function \(C_0\) we use Recall@10 or MSE, while as second objective function we use either nDCG@10 or nMCG@10.

As name convention, all the CM approaches are referred with the name of the first objective function \(C_0\), the corresponding number of trees used during the first training phase \(T_0\), followed by the name of the second objective function \(C_1\) with the corresponding number of trees \(T_1\). Therefore, recallT300_ndcgT200 means that the model is trained with Recall@10 for the first 300 trees and then with nDCG@10 for the following 200 trees. Moreover, we report as reference models for CM approaches nMCG-MART and the best performing model proposed in Ferro et al. (2018), i.e. mseT200_ndcgT300, which is the continuation method approach where the first 200 trees where trained with MSE as objective function, and the following 300 trees where trained with nDCG@10.

In the following sections we describe the experimental evaluation conducted to investigate the performance of each model. More in detail, we address the following research questions:

  • RQ1—what are the best performing models? Section 4.2 presents the evaluation of each model with respect to different measures;

  • RQ2—what is the impact of navigational and informational queries? Section 4.3 addresses the evaluation of each model focusing on navigational and informational queries;

  • RQ3—what is the impact of the number of used trees? Section 4.4 studies the tree-wise performance of each model and compares the ranking models on the basis of the number of trees and the CM switching points.

4.2 RQ1: Models evaluation

In this section we evaluate the proposed models with respect to different measures. Besides the baseline, nMCG-MART and recallT300_ndcgT200, we report the evaluation performance of recallT300_ndcgT200 and recallT400_ndcgT100, and recallT300_nmcgT200 and recallT300_nmcgT200. We tested all the continuation method approaches with combinations of recall followed by nDCG or nMCG and switching point \(T_0 \in \{100, 200, 300, 400\}\), however, due to space constraints, we report only those combinations that lead to better performance.

Each model is evaluated separately on each fold and then the evaluation scores are averaged across the folds. As evaluation measures we use nDCGFootnote 3 with two different cut-off thresholds, 10 and 100, Recall with cut-off at 10 and 100, and expected reciprocal rank (ERR) (Chapelle et al. 2009) with cut-off at 10. Tables 3 and 4 present the evaluation scores on MSLR-WEB10K and MSLR-WEB30K respectively.

Table 3 Models evaluation on MSLR-WEB10K test set with respect to different measures
Table 4 Models evaluation on MSLR-WEB30K test set with respect to different measures

In each table the best score achieved with respect to each fold and measure is highlighted in bold. We remark that nDCG evaluation of different rankers are commonly very close, and that, in web search rankers, changes of over half a percentage point of nDCG are considered major revisions (Chapelle et al. 2012). Moreover we computed Fisher’s randomization test with 100,000 random permutations, which accordingly to Smucker et al. (2007) is the most appropriate statistical test to evaluate whether two approaches differ significantly. Thus, when an approach is marked with \(^{\bullet }\), it means that it is significantly different from the baseline with the two-sided p value lower than \(\alpha = 0.05\). Similarly we compare each continuation method combining recall followed by nDCG, with the corresponding continuation method combining recall and nMCG, i.e. we compare recallT300_ndcgT200 against recallT300_nmcgT200 and recallT400_ndcgT100 against recallT400_nmcgT100. Significantly different approaches with the two-sided p value lower than \(\alpha \) are marked with \(^{+}\).

Table 3 shows that the CM approaches combining Recall and nMCG, namely recallT300_nmcgT200 and recallT400_nmcgT100, achieve the best performance on each fold in terms of nDCG@10. They are significantly different from both the baseline and the corresponding CM approaches which combine Recall followed by nDCG. This supports our hypothesis that the combination of CM and the user dynamics integrated by nMCG can boost LambdaMart performance.

The third best performing model is recallT300_ndcgT200, which is significantly different from the baseline, on three out of five folds and also when the average is considered. On the other hand, nMCG-MARTFootnote 4 and CM approaches recallT300_ndcgT200 and recallT400_ndcgT100 perform better than the baseline, but they are not significantly different.

When we consider nDCG with a greater cut-off, i.e. nDCG@100, mseT300_ndcgT200 becomes the best performing model on four out of five folds, being significantly different from the baseline on four folds and also on the average results. CM approaches involving the user dynamics represent the second and third best performing models, again significantly improving over the baseline. Even if overall they are significantly different from the corresponding CM approaches exploiting nDCG, this does not always hold when each fold is considered separately. Indeed, only on two out of five folds recallT300_ndcgT200 and recallT400_nmcgT100 are significantly different from recallT300_ndcgT200 and recallT400_ndcgT100. However, the CM approaches which exploit nMCG are significantly better than the baseline, while this is not true for those CM approaches exploiting nDCG.

When considering Recall@10, the evaluation outcomes are similar to those using nDCG@100. Thus, mseT300_ndcgT200 is the best performing model followed by CM which exploit the user dynamics. Overall, all these approaches achieve statistically significant improvements over the baseline, but when each fold is considered separately, this behaviour can not be generalized, being these approaches significantly different just on some folds.

Moreover, CM approaches perform close to the mseT300_ndcgT200 baseline, being recallT300_ndcgT200 the best performing model for nDCG on fold 5. This is further highlighted when the models are evaluated with respect to Recall@100. In this case, CM approaches that exploit nDCG instead of nMCG perform better and recallT300_ndcgT200 represent the best performing model, significantly improving over the baseline. CM approaches using nDCG are followed by mseT300_ndcgT200 and then by CM approaches exploiting nMCG.

As shown by the comparison between nDCG@10 against nDCG@100, and Recall@10 against Recall@100, CM approaches using nMCG perform better with lower cut-off threshold. Therefore, they better place relevant documents at the beginning of the ranked list. This is further confirmed when ERR@10 is taken into account. Table 3 shows that when ERR@10 is considered, CM approaches using nMCG are the best performing one. Moreover, recallT200_nmcgT300 is the only approach which is significantly better than the baseline with respect to the average over all the folds. CM approaches using nMCG are also significantly better than the corresponding approaches using nDCG both overall and with respect to each separate fold.

It is well known that ERR is an heavily top-biased measure, i.e. it assigns very high weights to relevant documents at top rank positions and assigns a great discount factor to the following rank positions. Therefore, the evaluation conducted with ERR@10 confirms our hypothesis that CM approaches leveraging the user dynamics improve the effectiveness at the top rank positions. Moreover, since the user dynamics exploited by nMCG is defined just for the first 10 rank positions, by extending the user dynamics further down in the ranking, we might be able to improve the score of our models at lower rank positions. However this is out of the scope of the present work and will be considered for future investigations.

Table 4 reports the evaluation scores of each model on MSLR-WEB30K instead of MSLR-WEB10K. The conclusions that we can infer from Table 4 are aligned to those inferred from Table 3.

Therefore, when we consider nDCG@10 as evaluation measure, CM approaches using nMCG, namely recallT300_nmcgT200 and recallT400_nmcgT100, are the best performing ones both on each separate folds and on the average across folds. Furthermore, they are significantly better than the baseline both overall and on each fold. Overall, they are also significantly different from the corresponding CM approaches adopting nDCG instead of nMCG. Again, this corroborates our intuition that combining CM and the user dynamics represents a successful strategy, which leads the models to perform better than the baseline and the CM approaches without user dynamics.

As it happens for MSLR-WEB10K, when nDCG@100 is considered, CM approaches using nMCG are less effective and mseT200_ndcgT300 is the best performing model, significantly improving over the baseline both on each fold and on the average across folds. Moreover, CM approaches using nMCG are not significantly different from those using nDCG. However, they are still significantly improving over the baseline.

When we consider Recall@10, the outcomes of the evaluation are similar to those using nDCG@10. mseT200_ndcgT300 is still the best performing model when the average across each fold is considered. However, if we break down the results on each fold, recallT400_nmcgT100 is the best performing model on three folds out of five. Moreover, recallT400_nmcgT100 is significantly better than the baseline and than recallT400_ndcgT100.

When we consider Recall@100, CM approaches using nDCG perform better than those using nMCG. Both recallT300_ndcgT200 and recallT400_ndcgT100 are the best performing models on three folds out of five and they are significantly better than the baseline. However, when considering the score across all the folds, mseT200_ndcgT300 is still the best performing model. Thus, when using Recall@100 as evaluation measure, we reach conclusions similar to those that can be inferred from nDCG@100.

Summing up, CM approaches exploiting the user dynamics perform better at lower rank positions, while CM approaches using nDCG and mseT200_ndcgT300 are somehow better in retrieving a higher number of relevant documents in the first 100 rank positions. This is further supported by the results when using ERR@10. recallT300_nmcgT200 is the best performing model in terms of ERR@10 and it is the only approach which is significantly better than the baseline. Moreover, CM approaches using nMCG are also significantly better than those using nDCG, with respect to both each separate fold and the average across folds.

4.3 RQ2: Models comparison on navigational and informational queries

In this section we focus the comparison among models on navigational and informational queries separately. Tables 5 and 6 report the evaluation scores on MSLR-WEB10K for navigational and informational queries, respectively. We adopt the same notation used for the tables presented in Sect. 4.2: the best score is reported in bold, \(^{\bullet }\) denotes statistically significant improvements with respect to LambdaMart, and \(^{+}\) denotes statistically significant improvements of CM approaches trained with nMCG with respect to those trained with nDCG.

Table 5 Models evaluation on MSLR-WEB10K navigational queries with respect to different measures

From Table 5 we can note that there are just a few models which are significantly better than the baseline for navigational queries. This trend is independent of the evaluation measure adopted, with a few exceptions represented by mseT200_ndcgT300 with respect to nDCG@10 and nDCG@100, and recallT300_nmcgT200 with respect to nDCG@10. This suggests that all the models achieve somehow a comparable performance when evaluated with respect to navigational queries. We argue that with navigational queries the number of relevant documents is limited and systems have little room for manoeuvre: they need to place the most relevant document at the top of the ranked list and, when doing this properly, they all perform equally well in achieving this task.

Furthermore, even if the results are not statistically significant, we can observe from Table 5 that no model is performing markedly better than the others, but there is instead a lot of variability depending on the measure and the fold under consideration. For example, when using nDCG@10, mseT200_ndcgT300 is the best performing model on average but it is the best model just on three out of five folds; when using nDCG@100 instead, mseT200_ndcgT300 is the best performing model on four folds. The scenario is different when we consider Recall@10, Recall@100 or ERR@10, where there isn’t any regular trend across the folds and the best performing model changes on each fold.

Thus, we conclude that, when using navigational queries, all the models achieve a comparable performance and none of them should be actually preferred. Indeed, except for a couple of cases, none of the proposed models is significantly better than the baseline.

However, this is not the case when we consider informational queries, as shown in Table 6. When using nDCG@10, CM approaches embedding the user dynamics are the best performing models on each fold. Moreover, they are significantly better than the baseline on each fold, while all the other models are not significantly different, except for mseT200_ndcgT300 in two cases: fold 2 and on the average. CM approaches using nMCG are also significantly better than those using nDCG, both on each fold and on the overall average.

Table 6 Models evaluation on MSLR-WEB10K Informational queries with respect to different measures

The experimental comparison of models on informational queries provides more insights on the performance of CM approaches combined with the user dynamics. The exploitation of the user dynamics together with CM boosts the ranking of queries for which several relevant documents are available. Indeed, it helps in identifying those documents that are more relevant than others and in placing them at the beginning of the ranked lists.

As observed on the whole dataset in Table 3, when the nDCG cut-off is increased to 100, CM approaches using nMCG are less effective than mseT200_ndcgT300. mseT200_ndcgT300 is the best performing model on the overall average and on three folds, while CM approaches using nMCG are the best performing models just on two folds. Both mseT200_ndcgT300 and CM approaches using nMCG are significantly better than the baseline. Furthermore, when the average across folds is considered, CM approaches using nMCG are better than their corresponding models using nDCG.

The evaluation using Recall@10 leads to conclusions comparable to those for nDCG@10. Even if the best performing model is mseT200_ndcgT300 instead of recallT400_nmcgT100, CM approaches using nMCG are the best performing models on three folds and they are significantly improving over the baseline. Again, when the Recall cut-off is extended to 100, CM approaches using nMCG are less effective than those using nDCG, which significantly improve over the baseline.

Therefore, we come to a conclusion similar to the one we claimed for the whole dataset: CM approaches using nMCG perform better at the top rank positions, while they are less effective when all the rank positions up to 100 are considered. Even if there is no statistical difference with respect to LambdaMart, this is further supported by ERR@10, which considers CM approaches using nMCG as the best performing models also on informational queries. Furthermore, CM approaches using nMCG are significantly better than those using nDCG over each fold with respect to ERR@10.

We conducted the same analysis for navigational and informational queries on MSLR-WEB30K, as reported on Tables 7 and 8 respectively. The experimental results are aligned with those reported for MSLR-WEB10K.

Table 7 Models evaluation on MSLR-WEB30K navigational queries with respect to different measures
Table 8 Models evaluation on MSLR-WEB30K informational queries with respect to different measures

Firstly, except for a few cases, there is no statistical improvement over the baseline for any measure or models when navigational queries are considered. This is the same trend observed in Table 5. However, CM approaches using nMCG are significantly better than those using nDCG when evaluated with nDCG@10, nDCG@100 and ERR@10. Thus, even if none of the models is performing better than the baseline, CM approaches using nMCG can be preferred to those using nDCG, especially when the interest is at the top rank positions.

Secondly, on MSLR-WEB30K the high variability of the best performing model with respect to the measure and the fold is even more visible than on MSLR-WEB10K. Given an evaluation measure, none of the proposed models is constantly the best performer across every fold. For instance, the best performing model is always different for each fold when nDCG@10 or ERR@10 are considered. The only exception is represented by Recall@10, where mseT200_ndcgT300 is significantly better than the other models on three folds and on the average across folds.

Therefore, also for navigational queries on MSLR-WEB30K, we can conclude that none of the models can be considered somehow outstanding or can be preferred over other models for each measure and fold. Except for a few cases, the proposed models are not significantly improving over the baseline and this can be due to the small amount of relevant documents included in the datasets. As discussed above, models just need to place the single highly relevant document at the beginning of the ranking and they are all comparable in achieving this task.

Finally, Table 8 reports the evaluation scores for each measure and fold with respect to informational queries on MSLR-WEB30K. When nDCG@10 is considered, CM approaches trained with nMCG are the best performing models, significantly improving over the baseline on each fold. Moreover, when the average across folds is considered they are also significantly better than CM approaches trained with nDCG. This further confirms our hypothesis that combining CM with the user dynamics represent a successful approach, especially in terms of nDCG@10.

We can observe a similar trend when Recall@10 is used as evaluation measure. In this case, mseT200_ndcgT300 is the best performing model when we consider the average across folds and it is significantly better than the baseline. However, when we consider the scores on each single fold, mseT200_ndcgT300 is the best model on fold 4 and fold 5, and similarly recallT400_nmcgT100 is the best performing model on fold 2 and fold 3. With respect to the average across folds, recallT400_nmcgT100 is the second best performing model, significantly improving over the baseline and over recallT400_ndcgT100.

ERR@10 further highlights this trend: recallT300_nmcgT200 is the best performing model on three folds out of five and when the average across folds is considered. Moreover, it is the only model which is significantly better than the baseline. Both recallT300_nmcgT200 and recallT400_nmcgT100 are significantly improving over recallT300_ndcT200 and recallT400_ndcgT100 both on each separate fold and on the average across folds.

However, as it was previously observed, when the cut-off is increased to 100, CM approaches using nMCG are less effective. Thus, when nDCG@10 is considered, mseT200_ndcgT300 is the best performing model, followed by CM approaches with Recall and nDCG. When using Recall@100, the ranking of models is reversed and CM approaches trained with Recall and nDCG are the best performing models followed by mseT200_ndcgT300 (even if their scores are equals up to the third decimal digit). Although CM approaches trained with nMCG are not the best performing models when the cut-off is set at 100, they are still significantly better then the LambdaMart baseline.

Therefore, we can conclude that with informational queries CM approaches exploiting the user dynamics represent the best models or are among the best ones when the focus is on the top 10 positions, meaning that exploiting the user dynamics is beneficial for identifying the most relevant documents and moving them towards the top of the ranking.

As discussed above, the user dynamics account only for the first 10 rank positions and an extension of the user dynamics further deep in the ranking might potentially improve the performance of these models, even at higher cut-off thresholds. Moreover, even if the proposed CM approaches using nMCG are not the best performing models with respect to nDCG@100 or Recall@100, they are still significantly improving over the LambdaMart baseline.

Finally, in Fig. 2 we further investigate how the nMCG metric impacts on the model training process compared to nDCG. Specifically, we computed the distribution of nDCG@10 scores generated by the LambdaMart model over the test queries of the Fold 1 of the MSLR-WEB30K dataset, and the distribution of nMCG@10 scores generated by the nMCG-MART model (we observed similar results on other folds). While the score distributions cannot be compared directly as they refer to different IR metrics, they can be used to analyse the behaviour of the two models. Figure 2 shows, distinguishing between informational and navigational queries, the nDCG@10 scores by LambdaMart for each query, where queries are sorted from least to best performing, and the nMCG@10 scores by nMCG-MART for the same queries in the same order. We can observe two very different behaviours. For navigational queries the two distributions are similar showing some correlation. This means that for whichever query LambdaMart is able to provide an high nDCG@10 then also nMCG-MART is accurate. Indeed, the user dynamics enforced by nMCG is very similar to the nDCG discount: the first rank position is highly promoted compared to the others as also shown in Fig. 1. For informational queries, the two distributions show instead a more limited correlation: queries where LambdaMart is accurate might not be the best queries for nMCG-MART and viceversa. This means that, to some extent, the nMCG-MART model devotes its focus and training efforts on a different subset of queries, due to the different user dynamics of the nMCG metric. This results in a better accuracy across different metrics, other than nMCG, as we observed in this sub section.

Fig. 2
figure 2

Comparison of nDCG@10 versus nMCG@10 score distributions for informational and navigational queries

4.4 RQ3: Tree-wise performance comparison

In this section we analyse the tree-wise performance of the proposed models. Figure 3 compares the performance of different models on MSLR-WEB30K for fold 1. On each plot, the x-axis reports the number of trees while the y-axis reports nDCG@10 score. Since the trend of the proposed models is similar across folds and datasets, we do not report the same figure for each fold and for MSLR-WEB10K, but we choose fold 1 as the explanatory example.

Fig. 3
figure 3

Comparison of continuation methods approaches trained with nMCG and with different switching points (a) and comparison of different models on test queries (b), just on navigational queries (c) and just on informational queries (d). All the figures refers to MSLR-WEB30K Fold 1 and all the models are evaluated with respect to nDCG@10

Figure 3a shows the tree-wise performance of different CM approaches trained with Recall, followed by nMCG, together with LambdaMart as baseline. For the CM approaches, we varied the switching point \(T_0\) in the set \(\{100, 200, 300, 400\}\). You can note that all the CM approaches are always better than the baseline, as the number of trees increases. Moreover, each boosting of the performance corresponds exactly to the switching point T0. Therefore, replacing the objective function from Recall to nMCG is beneficial for the learning algorithm at any switching point. Indeed, the learning algorithm gets an immediate boosting at the switching point and then persists in learning with a decreasing rate until a plateau is reached.

Figure 3b–d, compare recallT200_nmcgT300 with recallT200_ndcgT300, mseT200_ndcgT300, nMCG-MART and LambdaMart as baselines. Figure 3b shows the tree-wise performance of each model on the whole test set. Again, you can clearly note that all the proposed models are better than the baseline as the number of trees increases. Focusing on the CM approaches leveraging the user dynamics, it is interesting how these methods achieve the greatest boosting when the objective function is replaced, i.e. at \(T-0=200\). Indeed, when the number of trees is lower than 200, the best performing model is mseT200_ndcgT300, meaning that MSE leads to better nDCG@10 scores than Recall. However, after the switching point recallT200_nmcgT300 becomes the best performing model, thanks to nMCG which increases nDCG@10 scores more than nDCG. This supports once more our hypothesis that combining CM approaches with nMCG represents a successful strategy.

Figure 3d corresponds to Fig. 3b, but considers just informational queries. The general trend of each model is quite similar to the one illustrated in Fig. 3b, before the switching point MSE leads to better performance, while after the switching point nMCG allows the CM to achieve higher scores than nDCG, as the number of trees increases. This is perfectly aligned with the results reported in Table 4 and in Table 8, where the outcome of the models comparison is similar for the whole dataset and for informational queries with respect to nDCG@10.

Finally, Fig. 3c corresponds to Fig. 3b, but considers just navigational queries. It is extremely evident how the lines representing each model are not as smooth as in Fig. 3b or Fig. 3c. This perfectly mirrors the results reported in Table 7, which demonstrate how the best performing model strongly depends on the fold and none of the models is stable across the folds with respect to nDCG@10. Indeed, in Fig. 3c all the models lines cross each other multiple times and the best performing model depends on the number of trees. With less than 200 trees mseT200_ndcgT300 is the best model, then recallT200_nmcgT300 is the best model and, finally, with more than 400 trees mseT200_ndcgT300 is again the best performing model. Therefore, Fig. 3c further highlights that the considered models struggle in learning how to rank documents when the amount of relevance available is limited. As future work we plan to investigate different combinations of CM approaches to achieve better performances even on navigational queries.

5 Conclusion and future work

This paper investigated whether integrating in LtR the knowledge of the user-interaction model and the possibility of targeting different objective functions is profitable and allows us to train more effective ranking functions. Building on the results of our previous work, we modeled the user dynamics in scanning a ranked result list with Markov chains and integrated the resulting model in the loss function driving the learning process. Moreover, with the aim of better capturing the weak relevance signals hidden in the features representing the training samples, we designed the iterative learning of ranking functions based on forests of decision trees as a two step process, where the initial trees are trained by minimizing a simpler, possibly smoother objective function, while the remaining trees are trained by optimizing nDCG or our new nMCG measure aware of the user dynamics.

To assess our proposal we conducted an exhaustive set of reproducible experiments based on publicly available datasets and code. We swiped all the hyperparameters to devise the setting of the baselines and our proposed algorithms providing the best performance on the validation datasets. The results of the experiments, measured with different metrics and cut-offs, gave us a more clear understanding of the problem addressed and confirmed our initial intuitions. We showed, by also breaking down the analysis to different classes of queries, that the proposed continuation methods exploiting our user-aware nMCG measure consistently outperform the baselines by statistically significant margins.

As future work we will study different methods for applying click models to LtR datasets. A possibility suggested in Chuklin et al. (2015) could be to exploit editorial relevance labels as a link between different datasets and use a trained click model to generate new click-based features for datasets not providing them. Moreover, we plan to investigate more complex continuation methods, as for example with three steps instead of two, to understand whether there is a performance boosting with respect to each switching point and whether changing the objective function is always beneficial for the learning algorithm.