Elsevier

Computer Networks

Volume 183, 24 December 2020, 107574
Computer Networks

Size-based scheduling for TCP flows: Implementation and performance evaluation

https://doi.org/10.1016/j.comnet.2020.107574Get rights and content

Abstract

In many theoretical works, the benefits of size-based scheduling disciplines have been proved. The Foreground–Background discipline (or Least Attained Service (LAS)) and the multi-level processor sharing discipline are two examples of scheduling algorithms that provide an improvement of the expected flow completion time for heavily-tailed TCP flows with respect to the widely adopted Processor Sharing (PS) discipline.

In this work, we provide an implementation of the 2-level processor sharing queue (2LPS) for a Linux kernel and we measure its performance with TCP flow sizes taken from real-world datasets. 2LPS is the simplest implementation of a size-based scheduling and is parametrised with a threshold a: at each epoch, flows which received less than a service have hard priority on those that have crossed the threshold.

We give a numerical procedure to solve the 2LPS queue which allows for the computation of the optimal threshold for arbitrary job size distributions (approximated arbitrary well by generalised hyper-exponential distributions) and compare the results with those obtained by other disciplines (LAS and Shortest Remaining Processing Time (SRPT)).

Finally, we measure the performance of its implementation. Our findings show that the benefits of 2LPS, although slightly lower than those predicted by the theoretical model, are still significant and that using just two levels with optimal threshold seems the best trade-off between the complexity and resource demand of the implementation of a size-based scheduler and the improvements in the expected response time.

Introduction

TCP flows consist of a stream of packets with the same source and destination sockets [1], [2]. In this work, we study the classical scenario in which several TCP connections share the same router to receive or send data. In most implementations, the router simply serves the packets according to their arrival order although some theoretical works suggest that different policies may lead, in some cases, to substantial benefits in terms of reduction of the expected flow completion time. It is worth of notice that, although the router serves the packets according to a First Come First Served discipline, from the TCP flow prospective the discipline is very similar to a Processor Sharing (PS) [1], [3], i.e., a discipline that fairly divides the capacity of the bottleneck among the flows.

When the flow sizes are known in advance by the schedulers, the Shortest Remaining Processing Time (SRPT) discipline is the optimal policy for the minimisation of the expected completion (response) time of the flows [4] (see [5] for an implementation of an approximated SRPT). In our scenario, this information is not usually available, thus in this work SRPT represents an oracle discipline depicting the limit under which the expected response time cannot go in any case.

In the context we are considering, a more realistic solution to the improvement of the expected flow completion time is the class of algorithms that schedule the flows according to the amount of service that they have already received. The foreground–background is presented in [6] and, in the context of networking, has been called Least Attained Service (LAS). In this discipline, the flows are grouped according to the amount of service they have received. The flows in the group with the lowest amount of attained service are served according to a PS scheduling. The implementation of LAS in real devices poses serious problems including the need of managing many priority queues each of which stores the packets belonging to the group of flows that received the same amount of service. As a consequence, in the literature, some approximations have been introduced. In the multi-level version of LAS, the job size is divided into intervals by a certain number of thresholds. Intervals are ordered by priority and correspond to different queues. In this approximation, we consider all the flows that have received an amount of service belonging to the same interval to have the same priority.

In this paper, we focus on the multi-level approximation of LAS, more specifically on the approximation consisting of only two levels. Following [6], we call this discipline ‘two-level processor sharing’ (2LPS). In other works, this discipline is known as PS+PS [7].

Most of previous works focus either on stochastic simulation models or on analytical models for drawing their conclusions. Although analytical models provide deep insights on the considered scheduling disciplines, we must note that they rely on many assumptions that are not met in real-world scenarios. For example, the congestion mechanism of TCP is abstracted out, the possible timeouts of the packets with low priority and their impact on the performance indices are ignored. This paper aims at covering this gap by providing an implementation of the 2LPS scheduling discipline for a Linux kernel and by measuring its performance under a workload generated from real-world data centre traces. In order to achieve this goal, we address several theoretical and practical problems. Thus, the contributions of the paper can be summarised as follows:

  • 1.

    We give a computationally efficient solution of the 2LPS queueing discipline for TCP flows whose sizes are modelled by generalised hyperexponential (GH) distributions. Previously, the problem was solved for hyperexponential distributions, but these are not sufficiently expressive to fit data centres TCP flow traces. This result is practically important because it provides a computationally efficient method to derive the optimal parametrisation for 2LPS.

  • 2.

    We give an implementation of the 2LPS discipline in a Linux Kernel. This implementation is suitable to be installed in the network devices based on such an operating system.

  • 3.

    We consider three case studies. The first uses datasets validated by the scientific literature [8] but dates back to 2010. The other two are datasets collected at the data centre of Università Ca’ Foscari Venezia in 2019 at two different campuses. Using new traces is necessary to cope with the changes in the traffic streams that occurred in the last decade (increase of video on demand, streaming services etc.). The fitting of the traces into a GH distribution consists of two steps. By using PhFit [9], we obtain an acyclic phase type distribution (APH) fitting the data. Then, we prove that, under mild conditions, any APH distribution can be transformed into a GH and provide explicit equations for the conversion.

  • 4.

    We compare the performance predicted by the analytical model with those measured in the real system. Although we generally confirm the benefits of the 2LPS implementation over the standard Linux scheduler, we notice that these are slightly reduced with respect to what predicted by the analytical models.

  • 5.

    We compare the performance of 2LPS with the optimal threshold with those of LAS and SRPT.

  • 6.

    Finally, our experiments show that, in a 2LPS, the optimal value of the threshold mainly depends on the distribution of the job size while the intensity of the workload plays a minor role in its computation. As a consequence, setting the optimal threshold can be done for an average/high workload but still remains a good choice in other conditions. This observation has practical importance since while the intensity of the workload is expected to change frequently in real systems, the distribution of the TCP flow sizes tends to be stationary. As a consequence, in practical implementations, the computations of new optimal thresholds need not to be done too frequently.

A consequence of our findings is that it is possible to implement a scheduler based on 2LPS that dynamically computes the optimal threshold given the traffic intensity and the distribution of the TCP flow sizes. Indeed, the optimal threshold can be computed without much efforts and the heaviest part is that of the fitting procedure.

This paper extends the work [10] is two directions. First, it includes all the proofs of the theorems and propositions. Second, it considers two new case studies with datasets collected in 2019.

Internet traffic consists of many short flows and a small number of large flows that constitute more than half of the data volume. For this reason, several works have been done in order to differentiate short flows from large flows [1], [11], [12], [13], [14], [15].

In the last years, a great effort has been made to study multi-level scheduling policies, based on LAS (see [16] and the references therein) that can privilege short flows, without prior knowledge on the size of the flows. Examples of LAS based scheduling disciplines are LARS [17], which gives higher priority to packets of flows with less transferred traffic. In [18], the authors propose RuN and RuN2C which are two implementations in simulation models of LAS and 2LPS, respectively. In this work, the system assigns a priority to TCP packets according to the rank of their first data byte. In [12], [13], [19], the authors propose a two-class based architecture to provide improvements on the response time of short flows. They present results that show reasonable gains in the average performance. In [13], [19] the authors also discuss the analytical modelling of their approach, but they rely on an approximate numerical procedure to derive the performance indices. In our work, we have an exact solution to Kleinrock’s system of equations [6] which is also computationally efficient and stable. Symbolic solutions can be obtained if the number of phases of the GH distribution is small.

In [20], the authors provide bounds for the analysis of the conditional expected response time of multi-level processor sharing queues and important asymptotic results that are confirmed by our findings. With respect to [20], from a theoretical point of view, we give an exact solution for the GH service time distributions that can approximate any distribution arbitrary well.

For what concerns the queueing analysis, the results presented in [21], [22] are closely related to those shown here. The authors consider job size distributions that are hyperexponential in contrast with our more expressive choice of using GH distributions. In [22], the author proposes a solution to a PS queue with batch arrivals and hyperexponential service time distribution and studies the application to multi-level systems. In [21], the authors provide an expression for the optimal choice of the threshold when the hyperexponential distribution consists of two phases. Then, for more phases, symbolic upper and lower bounds to the conditional expected response time are derived. Aside the difference in the flow size distribution, with respect to our work, the paper does not address the problem of testing the discipline on real datasets.

Other important related works are [7], [23]. Here, the authors provide a mean delay analysis of 2LPS and multi-level PS, respectively, based on the numerical solution of a differential equation. The observation on the efficiency of the 2LPS with optimal threshold can be found also in [7], [23] for distributions with decreasing hazard rates. In our work, we confirm these results by taking a more practical approach and adopt real datasets to study the behaviour of the 2LPS with respect to SRPT and LAS (notice that the flow size distributions have asymptotically decreasing hazard rates, but are not decreasing on their entire support).

The paper is organised as follow: in Section 2, we review the scheduling disciplines that we consider in the paper. In Section 3, we present a method to fit TCP traces into GH distributions. Section 4 proposes the analytical model of the 2LPS scheduling with homogeneous Poisson arrivals and GH distribution for the job sizes. Section 5 describes the implementation of the 2LPS for the Linux scheduler, and Section 6 shows the application of our model to the old and new datasets, and shows the results of the benchmarking experiments. Finally, Section 7 concludes the paper.

Section snippets

Scheduling strategies

We are interested in the performance evaluation of different scheduling disciplines applied at the centralised device. Several TCP connections compete for the bandwidth controlled by this device and we assume that the scheduler can decide which packet to send among those stored in its queue. Each packet belongs to a stream which can be identified by the TCP control information, i.e., the source and destination sockets.

In the queueing abstraction of such a system, the jobs correspond to the TCP

Fitting the TCP flow sizes into GH distributions

In this section, we recall the basic notions and notation on the generalised hyperexponential (GH) distribution and introduce a method for fitting data samples taken from TCP flow sizes into GH random variables. This is done by using an existing tool, namely PhFit [26], that fits traces into acyclic phase-type random variables in a certain canonical form. Unfortunately, this canonical form does not necessary correspond to a GH distribution. Hence, we study the conditions for the existence of a

2LPS queueing systems with GH job size distributions

The implementation of 2LPS for Linux operating system requires its parametrisation with a threshold a. The analytical study of the 2LPS system plays a crucial role in the deployment of this scheduling discipline because it provides an efficient way to compute the optimal parametrisation. In Section 6, we will also show that the parametrisation provided by the analytical model is in fact very good for the performance of the deployed system.

While other solution methods of the 2LPS have been

Implementation of 2LPS in Linux Kernel and its performance

This section describes our implementation of the 2LPS discipline in the Linux kernel. In Section 6 we study its performance and compare it with the prediction of the theoretical model.

Results

In this section, we propose the outcomes of our analyses for the datasets described in Section 3, called UNI1, UNI2, UNI3. More specifically, we show:

  • The plots of the expected flow completion time as function of the threshold a for the three datasets for different load factors obtained by the analytical model. These plots are used to obtain the optimal threshold for the model.

  • We show the plot of the expected flow completion time conditioned to its size for the optimal threshold. In order to

Conclusion

In this paper, we have made a step in the direction of understanding the performance of multi-level size-based scheduling with real TCP traffic. Although in the literature the problem has been widely addressed by previous works, they relied on simulations or analytical models with synthetic traffic patterns. To cover this gap, we developed a procedure to fit real TCP traces for data centres into APH and GH distributions. The former is used to generate the workload in benchmarking experiments

CRediT authorship contribution statement

Andrea Marin: Conceptualization, Methodology, Formal analysis. Sabina Rossi: Conceptualization, Methodology, Formal analysis. Carlo Zen: Software, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Andrea Marin is an associate professor of Computer Science at the University Ca’ Foscari of Venice since 2016. He received his Ph.D. degree in Computer Science in 2007 from the same university. His main research interests include stochastic modelling of computer and communication systems for performance and reliability analysis, queueing theory, and models with product-form solutions. He has contributed in developing a probabilistic calculus for the formal analysis of wireless ad-hoc networks.

References (39)

  • KleinrockL.

    Queueing Systems, vol. II: Computer Applications

    (1976)
  • AaltoS. et al.

    Two-level processor-sharing scheduling disciplines: Mean delay analysis

    ACM Sigmetrics Perf. Eval. Rev.

    (2004)
  • T. Benson, A. Akella, D.A. Maltz, Network Traffic Characteristics of Data Centers in the Wild, in: Proc. of the 10th...
  • A. Bobbio, A. Horváth, M. Telek, PhFit: A General Phase-type Fitting Tool, in: Proc. of the Int. Conf. on Dependable...
  • MarinA. et al.

    Theoretical and experimental evaluation of the two-level processor sharing discipline for TCP flows

  • BiersackE.W. et al.

    Scheduling in practice

    ACM Sigmetrics Perf. Eval. Rev.

    (2007)
  • L. Guo, I. Matta, The war between mice and elephants, in: Proc. of the 9th IEEE Int. Conf. on Network Protocols, 2001,...
  • GuoL. et al.

    Scheduling flows with unknown sizes: Approximate analysis

    ACM Sigmetrics Perf. Eval. Rev.

    (2002)
  • I. A. Rai, G. Urvoy-Keller, Size based scheduling with differentiated services to improve response time of highly...
  • Cited by (6)

    Andrea Marin is an associate professor of Computer Science at the University Ca’ Foscari of Venice since 2016. He received his Ph.D. degree in Computer Science in 2007 from the same university. His main research interests include stochastic modelling of computer and communication systems for performance and reliability analysis, queueing theory, and models with product-form solutions. He has contributed in developing a probabilistic calculus for the formal analysis of wireless ad-hoc networks.

    Sabina Rossi received her Ph.D. in Computational Mathematics and Informatics from the University of Padova in 1994. She is Associate Professor of Computer Science at the University Ca’ Foscari of Venice since Nov. 2012. Formerly she has been Assistant Professor at Ca’ Foscari (2000–2012), visiting professor at Universitè Paris 7 (2007) and research fellow at the Universitè Catholique de Louvain-la-Neuve, Belgium (1997). She is the (co-)author of over 50 technical papers in refereed international journals and conference proceedings. Her current research focuses on the development of formal tools for the analysis and verification based on process algebraic techniques and, specifically, on stochastic process algebras. Sabina Rossi has been an invited speaker at IFIP Theoretical Computer Science TCS 2010 and has been in the programme committees of various international conferences and workshops.

    Carlo Zen is a master student of Computer Science at the University Ca’ Foscari of Venice. He graduated cum laude in Computer Science in 2019. His main interests include the performance evaluation of computer networks and the analysis of security protocols.

    View full text