1 Introduction

Machine learning models are vulnerable to backdoor poisoning [1,2,3,4]. These attacks consist of injecting poisoning samples at training time, with the goal of forcing the trained model to output an attacker-chosen class when presented with a specific trigger at test time, while working as expected otherwise. To this end, the poisoning samples typically need not only to embed such a backdoor trigger themselves, but also to be labeled as the attacker-chosen class. As backdoor poisoning preserves model performance on clean test data, it is not straightforward for the victim to realize that the model has been compromised.

Backdoor poisoning has been demonstrated in a plethora of scenarios [3,4,5]. In the most common scenario, the user is assumed to download a pre-trained, backdoored model from an untrusted source, to subsequently fine-tune it on their data [3]. As backdoors typically remain effective even after this fine-tuning step, they may successfully be exploited by the attacker at test time [1, 2]. Alternatively, the attacker is assumed to alter part of the training data collected by the user, either to train the model from scratch or to fine-tune a pre-trained model via transfer learning [6, 7]. In such cases, the training labels of poisoning samples may be either fully controlled by the attacker [6] or subject to validation by the victim [7], depending on the considered threat model. Backdoor attacks have been shown to be effective in different applications and against a variety of models, including vision [1] and language models [8], graph neural networks [9], and even reinforcement learning [10]. However, it is still unclear which factors influence the ability of a model to learn a backdoor, i.e., to classify the test samples containing the backdoor trigger as the attacker-chosen class.

In this work, we analyze the process of backdoor learning to identify the main factors affecting the vulnerability of machine learning models against this attack. To this end, we propose a framework that characterizes the backdoor learning process. The learning process of humans is usually portrayed using learning curves, a graphical representation of the relationship between the hours spent practicing and the proficiency to accomplish a given task. Inspired by this concept, we introduce the notion of backdoor learning curves (Sect. 2). To generate these curves, we formulate backdoor learning as an incremental learning problem and assess how the loss on the backdoor samples decreases as the target model gradually learns them. These backdoor learning curves are independent of the threat model as they capture training data and fine-tuning attacks. The slope of this curve, the backdoor learning slope, which is connected to the notion of influence functions, quantifies the speed with which the model learns the backdoor samples and hence its vulnerability. Additionally, to provide further insights about the backdoor’s influence on the learned classifiers, we propose a way to quantify the backdoor impact on learning parameters, i.e., how much the parameters of a model deviate from the initial values when the model learns a backdoor.

Our experimental analysis (Sect. 3) shows that the factors influencing the success of backdoor poisoning are: (i) the fraction of backdoor samples injected into the training data; (ii) the size of the backdoor trigger; and (iii) the complexity of the target model, controlled via its hyperparameters. Concerning the latter, our experimental findings reveal a region in the hyperparameter space where models are highly accurate on clean samples while also being robust to backdoor poisoning. This region exists as, to learn a backdoor, the target model is required to increase the complexity of its decision function significantly, and this is not possible if the model is sufficiently regularized. This observation will help identify novel criteria to improve existing defenses and inspire new countermeasures.

To summarize, the main contributions of this work are:

  • We introduce backdoor learning curves as a powerful tool to thoroughly characterize the backdoor learning process;

  • We introduce a metric, named backdoor learning slope, to quantify the ability of the classifier to learn backdoors;

  • We identify three important factors that affect the vulnerability against backdoors;

  • We unveil a region in the hyperparameter space in which the classifiers are highly accurate and robust against backdoors, which supports novel defensive strategies.

We conclude the paper by discussing related work in Sect. 4, along with the limitations of our approach and promising future research directions in Sect. 5.

2 Backdoor learning curves

In this section, we introduce our framework to characterize backdoor poisoning by means of learning curves and their slope. Afterward, we introduce two measures to quantify the backdoor impact on the model’s parameters.

Notation. We denote the input data and their labels respectively with \({\varvec{x}} \in {\mathbb {R}}^d\) and \(y \in \{1,..,c\}\), being c the number of classes. We refer to the untainted, clean training data as \({\mathcal {D}}_{\textrm{tr}} =({\varvec{x}}_i, y_i)_{i=1}^n\), and to the backdoor samples injected into the training set as \({\mathcal {P}}_{\textrm{tr}}=( \hat{{\varvec{x}}}_j, \hat{y}_j)_{j=1}^m\). We refer to the clean test samples as \({\mathcal {D}}_{\textrm{ts}}=( {{\varvec{x}}}_t, {y}_t)_{t=1}^k\) and to the test samples containing the backdoor trigger as \({\mathcal {P}}_{\textrm{ts}}=( \hat{{\varvec{x}}}_t, \hat{y}_t)_{t=1}^k\).

Fig. 1
figure 1

Backdoor learning curves. Considering an SVM with the RBF kernel (\(\gamma = 10\)) on a toy dataset in two dimensions, we show the influence of model complexity (controlled by the regularization hyperparameter \(\lambda =\frac{1}{C}\)) on backdoor learning. For both the strong (left) and weak (right) regularization settings, we report two plots. The left plot shows the two-dimensional data (dots) and decision surface for different values of \(\beta\) (green lines). The right plot shows the backdoor learning curve, i.e. how the loss decreases as \(\beta\) ranges from 0 to 1, which amounts to learning the backdoor samples. We plot both the loss on the clean test samples (orange dotted line) and on the test samples with the backdoor trigger (blue line). The slope of these curves represents the speed with which the model learns to classify the backdoor samples (black dots) as blue dots, unveiling that strong regularization slows down such a process

Backdoor learning curves. We leverage previous work from incremental learning [11, 12] to study how gradually incorporating backdoor samples affects the learned classifier. In mathematical terms, the learning problem can be formalized as:

$$\begin{aligned} \begin{aligned}&{\varvec{w}}^\star (\beta ) \in \mathop {\text {arg min}}\limits _{{\varvec{w}}} {\mathcal {L}} ({\mathcal {D}}_{\textrm{tr}} \cup {\mathcal {P}}_{\textrm{tr}}, {\varvec{w}}) \\&\quad = L ({\mathcal {D}}_{\textrm{tr}}, {\varvec{w}}) + \beta L({\mathcal {P}}_{\textrm{tr}}, {\varvec{w}}) + \lambda \Omega ({\varvec{w}}) , \end{aligned} \end{aligned}$$
(1)

where L is the loss attained on a given dataset by the classifier with parameters \({\varvec{w}}\), and \({\mathcal {L}}\) is the loss computed on the training points and the backdoor samples, which also includes a regularization term \(\Omega ({\varvec{w}})\) (e.g., \(\Vert {\varvec{w}}\Vert ^2_2\)), weighed by the regularization hyperparameter \(\lambda\). To gradually incorporate the backdoor samples \({\mathcal {P}}_{\textrm{tr}}\) into the learning process, we introduce the hyperparameter \(\beta \in [0,1]\), and increase it from 0 (unpoisoned classifier) to 1 (poisoned classifier). As \(\beta\) increases, the classifier gradually learns the backdoor by adjusting its parameters; for this reason, we make the dependency of the optimal weights \({\varvec{w}}^\star\) on \(\beta\) explicit as \({\varvec{w}}^\star (\beta )\).Footnote 1

We now define the backdoor learning curve as the curve showing the behavior of the classifier loss \(L( {\mathcal {P}}_{\textrm{ts}}, {\varvec{w}}^\star (\beta ))\) on the test samples with the backdoor trigger as a function of \(\beta\). In the following, we abbreviate \(L( {\mathcal {P}}_{\textrm{ts}}, {\varvec{w}}^\star (\beta ))\) as L. Intuitively, the faster the backdoor learning curve decreases, the easier the target model is backdoored. The exact details of how the model is backdoored do not matter for this analysis, e.g. our approach captures for example both the setting where the training data is altered as well as the setting where fine-tuning data is tampered with.

We give an example of two such curves under different regularizations in Fig. 1. The left plots depict a strongly regularized classifier. The corresponding backdoor learning curve (on the right) shows that the classifier achieves low loss and high accuracy on the backdoor samples only after poisoning (when \(\beta =1\)), i.e. even when the loss on the backdoor samples is considered equally important to the loss on the training samples. The classifier on the right, instead, is less regularized and thus more complex. Consequently, this classifier learns to incorporate the backdoor samples much faster (at low \(\beta\)), namely when the loss on the backdoor points is taken into account less than the one on the training data. This highlights that this classifier is probably more vulnerable to this attack.

Backdoor learning slope. We quantify how fast an untainted classifier can be poisoned by proposing a novel measure, namely the backdoor learning slope, that measures the velocity with which the classifier learns to classify the backdoor samples correctly. This measure allows us to compare the vulnerability of a classifier trained with different hyperparameters or consider different poisoning scenarios (e.g. when the attacker can inject a different number of poisoning points or create triggers with different sizes), allowing us to identify factors relevant to backdoor learning. Moreover, as we will show, this measure can be used by the system designer to choose an appropriate combination of hyperparameters for the task at hand. To this end, we define the backdoor learning slope as the gradient of the backdoor learning curve at \(\beta = 0\), capturing the velocity of the curve on learning the backdoor. Formally, the backdoor learning slope can be formulated as follow:

$$\begin{aligned} \frac{\partial L({\mathcal {P}}_{\textrm{ts}}, {\varvec{w}}^\star (\beta ))}{\partial \beta } = \frac{\partial L}{\partial {\varvec{w}}} \frac{\partial {\varvec{w}}^\star }{\partial \beta } , \end{aligned}$$
(2)

where the first term is straightforward to compute, and the second term implicitly captures the dependency of the optimal weights on the hyperparameter \(\beta\). In other words, it requires us to understand how the optimal classifier parameters change when gradually increasing \(\beta\) from 0 to 1, i.e., while incorporating the backdoor samples into the learning process.

To compute this term, as suggested in previous work in incremental learning [11], we assume that, while increasing \(\beta\), the solution maintains the optimality (Karush–Kuhn–Tucker, KKT) conditions intact. This equilibrium implies that \(\nabla _\beta \nabla _{{\varvec{w}}} {\mathcal {L}}({\varvec{w}}^\star ) + \frac{\partial {\varvec{w}}^\star }{\partial \beta }\nabla ^2_{{\varvec{w}}} {\mathcal {L}}({\varvec{w}}^\star ) = {\varvec{0}}\). Based on this condition, we obtain the derivative of interest,

$$\begin{aligned} \frac{\partial {\varvec{w}}^\star }{\partial \beta } = - (\nabla ^2_{{\varvec{w}}} {\mathcal {L}}({\varvec{w}}^\star ))^{-1} \cdot \nabla _\beta \nabla _{{\varvec{w}}} {\mathcal {L}}({\varvec{w}}^\star ). \end{aligned}$$
(3)

Substituting it in Eq. 2 we obtain the complete gradient:

$$\begin{aligned} \frac{\partial L({\mathcal {P}}_{\textrm{ts}}, {\varvec{w}}^\star (\beta ))}{\partial \beta } = - \nabla _{{\varvec{w}}} L \cdot (\nabla ^2_{{\varvec{w}}} {\mathcal {L}})^{-1} \cdot \nabla _{\beta } \nabla _{{\varvec{w}}} {\mathcal {L}} . \end{aligned}$$
(4)

The gradient in Eq. 4 corresponds to the sum of the pairwise influence function values \({\mathcal {I}}_{\textrm{up, loss}}({\varvec{x}}_{\textrm{tr}}, {\varvec{x}}_{\textrm{ts}})\) used by Koh et al. [13]. The authors indeed proposed to measure how relevant a training point is for the predictions of a given test point by computing \(\left. \frac{\partial L}{\partial \beta } \right| _{\beta =0} = \sum _t \sum _j {\mathcal {I}}_{\textrm{up, loss}}(\hat{{\varvec{x}}}_t, \hat{{\varvec{x}}}_j)\). To understand how this gradient can be efficiently computed via Hessian-vector products and other approximations, we refer the reader to [13] as well as to recent developments in gradient-based bilevel optimization [14,15,16]. Moreover, we show in Sect. 3.2.5 (Figs. 13 and 14) an example of the usage of influence functions for weakly and strongly regularized models.

The main difference between the approach by Koh et al. [13] and ours stems from their implicit treatment of regularization and our interest in understanding vulnerability to a subset of backdoor training points, rather than in providing prototype-based explanations. However, directly using the gradient of the loss wrt. \(\beta\) comes with two disadvantages. First, the slope is inverse to \(\beta\), and second, to obtain results comparable across classifiers, we need to rescale the slope. We thus transform the gradient as:

$$\begin{aligned} ~ \theta = - \frac{2}{\pi } \arctan {\left( \left. \frac{\partial L}{\partial \beta } \right| _{\beta =0} \right) } \in [-1,1] , \end{aligned}$$
(5)

where we use the negative sign to have positive values correlated with faster backdoor learning (i.e., the loss decreases faster as \(\beta\) grows). Computing 2/\(\pi\) of the gradient allows us to rescale the slope to be in the interval between \([-1,1]\). Hence, a value around 0 implies that the loss of the backdoor samples does not decrease. In other words, the classifier does not learn the backdoor trigger and is hence very robust.

Backdoor impact on learning parameters. After introducing the previous plot and measure, we can quantify how backdoors are learned by the model. To provide further insights about the backdoor’s influence on the learned classifier, we propose to monitor how the classifier’s parameters deviate from their initial, unbackdoored values once a backdoor is added. Our approach below captures only convex learners. As shown by Zhang et al. [17], the impact of a network weight in non-convex classifiers’ decisions depends on the layer of which it is part. Therefore, measuring the parameter deviation in the non-convex case is challenging, and we leave this unsolved problem for future work.

To capture the backdoor impact on learning parameters in the convex case, we consider the initial weights \({\varvec{w}}_0 = {\varvec{w}}^\star (\beta =0)\) and \({\varvec{w}}_\beta = {\varvec{w}}^\star (\beta )\) for \(\beta >0\), and measure two quantities:

$$\begin{aligned} \rho = \Vert {\varvec{w}}_\beta \Vert \in [0, \infty ), \; \textrm{and } \; \nu = \frac{1}{2} \left( 1- \frac{{\varvec{w}}_0^{\top }{\varvec{w}}_\beta }{ \Vert {\varvec{w}}_0 \Vert \Vert {\varvec{w}}_\beta \Vert } \right) \in [0,1] . \end{aligned}$$
(6)

The first measure, \(\rho\), quantifies the change of the weights when \(\beta\) increases. This quantity is equivalent to the regularization term used for learning. The second one, \(\nu\), quantifies the change in orientation of the classifier. In a nutshell, we compute the angle between the two vectors and rescale it to be in the interval of [0, 1]. Both metrics are defined to grow as \(\beta \rightarrow 1\), in other words the backdoored classifier deviates more and more from the original classifier. Consequently, in the empirical parameter deviation plots in Sect. 3.2, we report the value of \(\rho (\nu )\) (on the y-axis) as \(\beta\) (on the x-axis) varies from 0 to 1, to show how the classifier parameters are affected by backdoor learning.

3 Experiments

Employing the previously proposed methodology, we carried out an empirical analysis of linear and nonlinear classifiers. In this section, we start with the experiments aimed at studying the impact of different factors on backdoor learning. To this end, we employ the backdoor learning curves and the backdoor learning slope to study how the capacity of the model to learn backdoors changes when (a) varying the model’s complexity, defined by its hyperparameters, (b) the attacker’s strength, defined by the percentage of poisoning samples in the training set and (c) the trigger size and visibility. Our results show that these components significantly determine how fast the backdoor is learned, and consequently, the model’s vulnerability. Then, leveraging the proposed measures to analyze how the classifier’s parameters change during backdoor learning, we provide further insights into the effect of the aforementioned factors on the trained model. The results presented in this section will help identify novel criteria to improve existing defenses and inspire new countermeasures. The source code is available on the author’s GitHub page.Footnote 2

3.1 Experimental setup

Our work investigates which factors influence backdoor vulnerability considering convex learners and neural networks. In the following, we describe our datasets, models, and the backdoor attacks studied in our experiments.

Datasets. We carried out our experiments on MNIST [18], CIFAR10 [19] and Imagenette [20].Footnote 3 Supplementary details on the datasets are reported in Appendix A.

When adopting convex learners, we consider the two-class subproblems as in the work by Saha et al. [7] and Suya et al. [21]. On MNIST, we choose the pairs \(7~\textrm{vs}~1\), \(3~\textrm{vs}~0\), and \(5~\textrm{vs}~2\), as our models exhibited the highest clean accuracy on these pairs. On CIFAR10, analogous to prior work [7], we choose airplane vs frog, bird vs dog, and airplane vs truck. On Imagenette we randomly choose tench vs truck, cassette player vs church, and tench vs parachute. For each two-class subtask, we use 1500 and 500 samples as training and test set respectively. In the following section, we focus on the results of one pair on each dataset: \(7~\textrm{vs}~1\) on MNIST, airplane vs frog  on CIFAR10, and tench vs truck  on Imagenette. The results of the other pairs (reported in Appendix B) are analogous. When testing our framework against neural networks, we train on all ten classes of Imagenette. We use \(70\%\) and \(30\%\) of the entire dataset for training and testing, respectively.

Models and training phase. To thoroughly analyze how learning a backdoor affects a model, we consider different convex learning algorithms, including linear Support Vector Machines (SVMs), Logistic Regression Classifiers (LCs), Ridge Classifiers (RCs), nonlinear SVMs using the Radial Basis Function (RBF) kernel, and deep neural networks. We train the classifiers directly on the pixel values scaled in the range [0, 1] on the MNIST dataset. For CIFAR10 and Imagenette, we instead consider a transfer learning setting frequently adopted in the literature [13, 22, 23]. Like Saha et al. [7], we use the pre-trained model AlexNet [24] as a feature extractor. The convex learners are then trained on top of the feature extractor. We study these convex learners due to their broad usage in industry [21], derived from their excellent results with smaller dataset [25], and good interpretability [26, 27]. In addition, we include in our evaluation pre-trained Resnet18 and Resnet50 [28] deep neural networks, sourced from Torchvision [29], which stand as some of the most extensively employed architectures [17]. These networks are fine-tuned to classify samples from the Imagenette dataset accurately.

Hyperparameters. The choice of hyperparameters has a relevant impact on the learned decision function. For example, some of these parameters control the complexity of the learned function, which may lead to overfitting [30], thereby potentially compromising classification accuracy on test samples. We argue that a high complexity may also lead to higher importance to outlying samples, including backdoors, and thus has a crucial impact on the capacity of the model to learn backdoors. To verify our hypothesis, we consider different configurations of the models’ hyperparameters. For convex learners, we study two hyperparameters that impact model complexity, i.e., the regularization hyperparameter \(\lambda =\frac{1}{C}\) and the RBF kernel hyperparameter \(\gamma\). To this end, we take 10 values for \(\lambda\) on a uniformly spaced interval on a log scale from \({1}\textrm{e}{-04}\) to \({1}\textrm{e}{+02}\). For the Imagenette dataset we extend this interval in \([{1}\textrm{e}{-05}, {1}\textrm{e}{+02}]\). Concerning the RBF kernel, we let \(\gamma\) take 5 uniformly spaced values on a log scale in \([{5}\textrm{e}{-04}, {5}\textrm{e}{-02}]\) for MNIST, \([{1}\textrm{e}{-04}, {1}\textrm{e}{-02}]\) for CIFAR, and \([{1}\textrm{e}{-05}, {1}\textrm{e}{-03}]\) for Imagenette. Furthermore, we take 10 values of \(\lambda\) in the log scale uniformly spaced interval \([{1}\textrm{e}{-01}, {1}\textrm{e}{+02}]\) for the RBF kernel. This allowed us to study a combination of 10 and 50 hyperparameters for linear classifiers and RBF SVM, respectively.

For deep neural networks, we consider two different numbers of epochs: 10 and 50, and increase the number of neurons when using Resnet50 instead of Resnet18. Whereas size intuitively correlates with complexity, previous works, including [31], show that decreasing the number of training epochs reduces the complexity of the trained network as well. Conversely, increasing epochs leads to overfitting on the training dataset, thus, a more complex decision function. Each network is fine-tuned using the SGD optimizer with a learning rate of 0.001, a momentum of 0.9, and a batch size of 256.

Backdoor attacks. We implement the backdoor attacks proposed by Gu et al. [1] against MNIST and CIFAR10. More concretely, we use a random \(3 \times 3\) patch as the trigger for MNIST, while on CIFAR10, we increase the size to \(8\times 8\) to strengthen the attack [7]. We add the trigger pattern in the lower right corner of the image [1]. Samples from MNIST and CIFAR10 with and without trigger can be found in Fig. 12. However, in contrast to previous approaches [1], we use a separate trigger for each base-class i. The reason is that our study encompasses linear models that are unable to associate the same trigger pattern to two different classes. Using different trigger patterns, we enhance the effectiveness of the attack on these linear models. On the Imagenette dataset, we use the backdoor trigger developed by [32]. This attack injects a patterned perturbation mask into training samples to open the backdoor. A constant value \(c_m\) refers to the maximum allowed intensity. We apply the backdoor attacks to 10% of the training data if not stated otherwise, and, as done by Gu et al. [1], we force the backdoored model to predict the i-th class as class \((i + 1)\% n\_classes\) when the trigger is shown. We also report additional experiments concerning variations in the trigger’s size or visibility.

3.2 Experimental results

In the following, we now discuss our experimental results obtained with the datasets, classifiers, and backdoor attacks described above.

Fig. 2
figure 2

Backdoor learning curves for: (top row) logistic classifier (LC) on MNIST 7 vs. 1 with \(\lambda \in \{10, 0.01\}\) and trigger size \(3 \times 3\) (left) or \(6 \times 6\) (right); (middle row) Ridge classifier on CIFAR10 airplane vs frog with \(\lambda \in \{100000, 100\}\) and trigger size \(8\times 8\) (left) or \(16\times 16\) (right); (bottom row) RBF SVM with \(\gamma ={1}\textrm{e}{-04}\) on Imagenette tench vs truck with \(\lambda \in \{10, 0.1\}\) and trigger visibility \(c_m=10\) (left) or \(c_m=75\) (right). Darker lines represent a higher fraction of poisoning samples p injected into the training set. We report the loss on the clean test samples (TS) with a dashed line and on the test samples with the backdoor trigger (TS+BT) with a solid line

Fig. 3
figure 3

Backdoor learning curves for Resnet18 trained on the full Imagenette training dataset with 10 and 50 epochs. Darker lines represent a higher fraction of poisoning samples p injected into the training set. We report the loss on the clean test samples (TS) with a dashed line and on the test samples with the backdoor trigger (TS+BT) with a solid line

Fig. 4
figure 4

Backdoor learning curves for MNIST \(7~\textrm{vs}~1\) (top row), CIFAR10 airplane vs frog (middle row) and Imagenette tench vs truck (bottom row) when changing the kernel parameter \(\gamma\) on RBF SVM. Darker lines represent a higher fraction of poisoning samples p injected into the training set. We report the loss on the clean test samples (TS) with a dashed line and on the test samples with the backdoor trigger (TS+BT) with a solid line

3.2.1 Backdoor learning curves

Here we present the results obtained using the learning curves that we proposed to study the impact of three different factors on the backdoor learning process: (i) model complexity, (ii) the fraction of backdoor samples injected, and (iii) the size and visibility of the backdoor trigger. We report the impact of these factors on the backdoor learning curves in Figs. 2 and 3. More specifically, in Fig. 2 we consider convex classifiers (i.e. LC, RC and RBF SVM) trained on two-class subproblems (MNIST, CIFAR10, and Imagenette), whereas in Fig. 3 we show the results for Resnet18 trained on all the ten classes of Imagenette.

To analyze the first factor, we report the results on the same classifiers, changing the hyperparameters that influence their corresponding complexity. In the case of convex learners, we test different values of the regularization coefficient, while for Resnet18, we increase the number of epochs. To analyze the impact of the second factor, we plot the backdoor learning curves when the attacker injects an increasing percentage of poisoning points \(p \in \{0.01, 0.1, 0.2\}\) for convex learners and \(p \in \{0.05, 0.15\}\) for Resnet18. Finally, to study the third factor, namely the size and visibility of the backdoor trigger, we have created the same backdoor curves doubling the size of the patch triggers for MNIST and CIFAR10, and increasing the trigger’s visibility for Imagenette. Even when a high percentage of poisoning points are injected, for flexible enough classifiers, the loss on the clean test samples remains almost constant. Instead, the loss on the test set containing the backdoor trigger is highly affected by the factors mentioned above. Both a smaller \(\lambda\) or a larger number of epochs (low regularization and thus higher complexity), and larger p (a high percentage of poisoning points added) increase the slope of the backdoor learning curve. This means that the classifier learns the backdoor faster. When the classifier is sufficiently complex, even a low percentage of low-poisoning points is enough to rapidly induce the classifier to learn the backdoor. On the other hand, this does not apply to highly regularized classifiers, which generally exhibit slower learning of backdoors. Therefore, limiting the classifier’s complexity by selecting an appropriate regularization coefficient may mitigate vulnerability to backdoors. Furthermore, our results demonstrate that larger trigger sizes lead to faster backdoor learning by classifiers, particularly when they are not regularized. This observation holds when increasing the trigger’s visibility, highlighting the well-known trade-off between the attacker’s strength and detectability as introduced by Frederickson et al. [33]. The attacker can enhance the backdoor’s effectiveness by increasing the trigger size or its visibility. However, these adjustments also make it easier for the defender to detect the attack.

Concerning the RBF SVM’s robustness to backdoors, we analyzed the backdoors’ learning curves for different values of \(\gamma\), which determine the RBF kernel’s curvature. More precisely, depending on \(\gamma\), we have analyzed the backdoor learning curves, and the classifier’s parameters change due to backdoor learning. We depict the learning curves in Fig. 4. On both datasets, reducing \(\gamma\) results in flatter backdoor learning curves and increased test loss, indicating greater robustness.

Remarks. Overall, our experiments show that to learn a backdoor, a classifier has to increase its complexity (if it is not already highly complex). However, an increase in complexity is limited when the classifier is highly regularized or when the attack strength is constrained. For this reason, highly regularized classifiers are preferable in terms of backdoor robustness. We present supplementary results for other classifiers in Appendix B, validating the trends above described.

Fig. 5
figure 5

Backdoor slope \(\theta\) vs clean accuracy (red) and backdoor effectiveness (blue) on MNIST 7 vs. 1 with backdoor trigger size \(3\times 3\) (top row) and \(6 \times 6\) (bottom row). We measure the classification accuracy on the untainted test samples (TS), and on the same samples after injecting the backdoor trigger (TS+BT). We chose the \(\gamma\) parameter for the RBF kernel as \(\gamma _0={5}\textrm{e}{-04}\) (orange triangle for clean data, light blue plus for data with trigger) and \(\gamma _1={5}\textrm{e}{-03}\) (red inverted triangle for clean data, dark blue x for data with trigger)

Fig. 6
figure 6

Backdoor slope \(\theta\) vs clean accuracy (red) and backdoor effectiveness (blue) on CIFAR10 airplane vs frog with backdoor trigger size \(8 \times 8\) (top row), and \(16 \times 16\) (bottom row). We measure the classification accuracy on the untainted test samples (TS), and on the same samples after injecting the backdoor trigger (TS+BT). We chose the \(\gamma\) parameter for the RBF kernel as \(\gamma _0={1}\textrm{e}{-04}\) (orange triangle for clean data, light blue plus for data with trigger) and \(\gamma _1={1}\textrm{e}{-03}\) (red inverted triangle for clean data, dark blue x for data with trigger)

Fig. 7
figure 7

Backdoor slope \(\theta\) vs clean accuracy (red) and backdoor effectiveness (blue) on Imagenette tench vs truck with trigger visibility \(c_m=10\), i.e. almost imperceptible, (top row) and \(c_m=75\) (bottom row). We measure the classification accuracy on the untainted test samples (TS), and on the same samples after injecting the backdoor trigger (TS+BT). We chose the \(\gamma\) parameter for the RBF kernel as \(\gamma _0={1}\textrm{e}{-05}\) (orange triangle for clean data, light blue plus for data with trigger) and \(\gamma _1={1}\textrm{e}{-04}\) (red inverted triangle for clean data, dark blue x for data with trigger)

3.2.2 Backdoor slope

From the previous results, we have seen that reducing complexity through regularization increases robustness against backdoors. For a deeper understanding of model complexity on backdoor learning, we leverage the proposed backdoor slope. In our experiments for convex learners, we fix the fraction of injected poisoning points to 0.1, as by Gu et al. [1], and we report a dot for each combination of \(\lambda\) and \(\gamma\) as specified in Sect. 3.1. Figures 5, 6 and 7 show the relationship between the backdoor slope and the backdoor effectiveness, measured as the percentage of samples with the trigger that mislead the classifier, respectively for MNIST, CIFAR10 and Imagenette. We report the accuracy on the clean test dataset (TS) and the test dataset with the backdoor trigger (TS+BT). For the RBF SVM, we report the accuracy for two different \(\gamma\) values.

Interestingly, our plots show a region where the accuracy of the classifiers on benign samples is high, yet the classifier exhibits low accuracy on samples with triggers. For linear classifiers, this region equals low-regularized classifiers. In the case of the RBF SVM, the best trade-off is achieved with high \(\lambda\) (strong regularization) and small \(\gamma\), which also constrain SVM’s complexity. Our results thus indicate that in these cases, the classifier is not flexible enough to learn the backdoor in addition to the clean test samples. Conversely, as long as the classifier has enough flexibility, it can learn the backdoor without sacrificing clean test accuracy. In a nutshell, by choosing the hyperparameters appropriately, we can obtain a classifier able to learn the original task but not the backdoor. However, there is a trade-off between the accuracy of the original task and the robustness of backdoor classification. In Figs. 5, 6 and 7, we extend the comparison between the backdoor learning slope and the attack effectiveness, considering a stronger attack that exploits larger or more visible triggers. With these attacks, the trade-off region is diminished, resulting in fewer viable configurations of hyperparameters that yield a robust model. This result aligns with our earlier findings based on backdoor learning curves: as the attack strength increases, the learning curve descends more steeply, exhibiting a higher backdoor slope. Our results suggest that system designers should thus prioritize maximizing regularization in models, ensuring the tradeoff with accuracy remains acceptable, to deploy a more robust and effective machine learning model against potential backdoor attacks.

Fig. 8
figure 8

Influence of \(\gamma\) (x-axis) and \(\lambda\) (y-axis) on the backdoor effectiveness (right) and clean accuracy (left) for CIFAR10 airplane vs frog (top row) and Imagenette tench vs truck (bottom row). The Backdoor is mounted with trigger size \(8 \times 8\) for CIFAR10 and visibility \(c_m=75\) for Imagenette. The plots show that there are hyperparameter configurations for which clean accuracy is high (red regions on the left plots), while the accuracy on the backdoored points is low (blue regions on the right plots)

As a final check to assess the reliability of the backdoor learning slope, we plot in Fig. 8 clean and backdoor accuracy when training an RBF SVM with different hyperparameter configurations. We train, for each configuration, the classifier on the poisoned dataset. On the top row, we show the results for CIFAR10 airplane vs frog, and on the bottom row the results for Imagenette tench vs truck. We followed the same backdoor setting for the backdoor slope in Figs. 5, 6 and 7, i.e. trigger size \(8 \times 8\) for MNIST and trigger visibility \(c_m=75\) for Imagenette. Analogous to our previous findings, there exists a trade-off region where the clean accuracy is high (red), while the backdoor accuracy is low (blue), suggesting a higher robustness against backdoors. Analogous to the backdoor slope, the best region is obtained with reduced complexity, thus regularizing it or reducing \(\gamma\). Consequently, we conclude that smaller values of \(\gamma\) can effectively guide the model toward a more robust configuration against backdoor attacks. Nevertheless, it also emerges that the most decisive factor influencing model robustness remains the regularization term.

Table 1 Backdoor learning slope for Resnet18 and Resnet50 when increasing the percentage of backdoor poisoning p, the number of epochs (#Epochs), and parameter h for estimate in Eq. 7

While this measure works well on convex learners, its roots in influence functions prevent a direct application on deep neural networks. As pointed out in [34] the analytical gradient in Eq. 5 at \(\beta = 0\) is unstable for deep neural networks. To overcome this deficiency, we estimate it with finite difference approximation, obtaining:

$$\begin{aligned} \left. \frac{\partial L}{\partial \beta } \right| _{\beta =0} = \frac{L({\mathcal {P}}_{\textrm{ts}}, {\varvec{w}}^\star (h)) - L({\mathcal {P}}_{\textrm{ts}}, {\varvec{w}}^\star (0))}{h} \,. \end{aligned}$$
(7)

We report the the results for Resnet18 and Resnet50 in Table 1 where we used \(h = \{0.01, 0.1, 0.2, 1\}\). For each combination of poisoning percentage and number of epochs, we report the estimate of the backdoor learning slope when choosing different h values. The closer h is to 0, the closer to 1 is the backdoor slope of the neural network. This result is consistent with Fig. 3, where the backdoor learning curves drop similarly fast, suggesting a high vulnerability of the model in the presence of backdoor samples. A subtle difference is that when increasing h, there is more evidence for higher vulnerability of neural networks trained with more epochs or when increasing the percentage of poisoning points.

Remarks on model selection/hyperparameter tuning. The observed degradation of accuracy on the backdoored samples (blue dots) at lower slopes in Figs. 5, 6 and 7 suggests that models with a minimal backdoor learning slope, induced by a strong regularization, retain the ability to remain robust. Quite surprisingly, this phenomenon does not have a substantial impact on the classification accuracy of the pristine samples (red triangles), highlighting an advantageous trade-off for defending against backdoor attacks. In other words, there is a wide region in the hyperparameter space in which the model still keeps a very high accuracy on clean data, but it is essentially unable to learn the backdoor samples. This trade-off enables the defender to find a sweet spot in which the model can be sufficiently robust to backdoor attacks. In practice, this means that well-regularized models can be made resilient to backdoor attacks with a negligible impact on classification accuracy, by simply performing an appropriate choice of the hyperparameters, including the regularization term \(\lambda\), number of epochs or neurons. In particular, such hyperparameters can be tuned to regularize the learning process as much as possible, while retaining an acceptable classification accuracy for the task at hand. We conclude this section by pointing out that this finding is all but trivial, as also highlighted in recent work: Bagdasaryan and Shmatikov [35] claim that using well-regularized models is the only effective and applicable defense for systems that are deployed and maintained in practical machine-learning operations (MLOps) pipelines, as any other defensive technique will require significant and costly changes to the deployment pipeline infrastructures, inducing a high technical debt on future system maintenance.

3.2.3 Empirical Parameter Deviation Plots

After having investigated which factors influence backdoor effectiveness, we shift our focus to examining how the model’s weights change during the training process when the dataset is tainted with backdoor samples. We aim to determine whether there is an increase in complexity or not.

We use our two measures proposed in Sect. 2, \(\rho\) and \(\nu\) to analyze the parameter change. The former, \(\rho\), monitors the change of the weights, for example, whether they increase or decrease. The latter, \(\nu\), measures the change in orientation or angle of the classifier. We plot both measures with different regularization parameters, trigger size, or visibility with a fraction of poisoning points to \(p=0.1\) in Figs. 9, 10 and 11. Within each plot, we also report the backdoor accuracy (BA) representing the model’s performance on backdoor samples at the end of training.

On linear classifiers, \(\rho ({\varvec{w}})\) increases during the backdoor learning process. This equals an increase in the weights’ values, suggesting that the classifiers become more complex while learning the backdoor. However, when investigating the RBF SVM, the results are slightly different. Indeed, when increasing \(\gamma\) and decreasing \(\lambda\), the classifier becomes flexible and complex enough to learn the backdoor without increasing its complexity. On the other hand, when decreasing \(\gamma\), the model is constrained to behave similarly to a linear classifier. In this way, analogously to linear classifiers, the model needs to increase its complexity to learn the backdoor. When increasing the trigger size or visibility the results are similar, thus confirming the previous analysis. However, as a result of increasing the attacker’s strength, the backdoor accuracy turns out to be higher.

Fig. 9
figure 9

Backdoor weights deviation for the logistic classifier (LC), support vector machine (SVM), the ridge classifier (RC) and SVM with RBF kernel on MNIST \(7~\mathrm {vs.}~1\) poisoned with backdoor trigger [1]. We report the results for trigger size \(3 \times 3\) (top row) and \(6 \times 6\) (bottom row). We specify the regularization parameter \(\lambda\) and backdoor accuracy (BA) for each setting in the legend of each plot

Fig. 10
figure 10

Backdoor weights deviation for the logistic classifier (LC), support vector machine (SVM), the ridge classifier (RC) and SVM with RBF kernel on CIFAR10 airplane vs frog poisoned with backdoor trigger [1]. We report the results for trigger size \(8 \times 8\) (top row) and \(16\times 16\) (bottom row). We specify the regularization parameter \(\lambda\) and backdoor accuracy (BA) for each setting in the legend of each plot

Fig. 11
figure 11

Backdoor weights deviation for the logistic classifier (LC), support vector machine (SVM), the ridge classifier (RC) and SVM with RBF kernel on Imagenette tench vs truck poisoned with backdoor trigger [32]. We report the results for visibility \(c_m=10\) (top row) and \(c_m=75\) (bottom row). We specify the regularization parameter \(\lambda\) and backdoor accuracy (BA) for each setting in the legend of each plot

3.2.4 Explaining backdoor predictions

In the following, we give a graphical interpretation of the poisoned convex-classifier’s decision function, expressed by its internal weights, for which interpretation of their results is easier [26, 27]. We consider the results for a backdoor trigger [1] in a specific position, as its influence on the classifier decision is easier to see. Conversely, the backdoor trigger by for example Zhong et al. [32] spans the entire image, and therefore its influence is harder to spot from the interpretability plots. In particular, given a sample x we aim to compute and show the gradient of the classifier’s decision function with respect to x. We use an SVM with regularization \(\lambda = {1}\textrm{e}{-02}\) for MNIST \(7~\textrm{vs}~1\) and CIFAR10 airplane vs frog, and report the results in Fig. 12. For MNIST, we consider the digit 7 with the trigger, showcasing the gradient of the clean classifier’s decision function. We present the results of the gradient from the clean and poisoned classifiers corresponding to the clean and backdoored inputs. Since we train a linear classifier on the input space, the derivative coincides with the classifier’s weights. Intriguingly, the classifier’s weights increase in magnitude and now exhibit high values in the bottom right corner, where the trigger is located. From CIFAR10, we show a poisoned airplane. We report the gradient mask obtained by considering the maximum value for each channel, both for the clean and backdoored classifier. Also, in this case, the backdoored model shows higher values in the bottom right region, corresponding to the trigger location. This means that the analyzed classifiers assign high importance to the trigger to discriminate the class of the input points.

Summarizing, the plots in Fig. 12 further confirm our findings regarding the change of the internal parameters during the backdoor learning process. In particular, we have seen that less regularized classifiers need to increase their weights and thus complexity to learn the backdoor. Conversely, when the flexibility of the classifier increases then it can learn the backdoor easier without significantly altering its complexity.

Fig. 12
figure 12

Input gradients of untainted and poisoned SVMs on pristine (top row) and backdoored (bottom row) test samples. Each row shows two sets of three images. Each set contains an example from MNIST 7 vs 1 or CIFAR10 airplane vs frog (left), along with the corresponding input gradient of the untainted SVM (middle), and of the poisoned SVM (right). For CIFAR10, we consider the maximum gradient of each pixel among the three channels

3.2.5 Visualizing influential training data points

Influence functions are used in the context of ML to identify the training points more responsible for a given prediction [13]. In Sect. 2 we have seen how they represent the basis of our backdoor learning slope measure. In this section, we employ them to show their outcomes and provide further insight into the relationship between complexity and backdoor effectiveness. To this end, as in Sect. 3.1, we poison \(10\%\) of the training dataset. According to previous experiments, we employed the backdoor trigger in [1] for MNIST and CIFAR10 with trigger size \(3 \times 3\) and \(6 \times 6\) respectively, while for Imagenette we employed the trigger in Zhong et al. [32] with higher visibility (i.e. \(c_m=75\)). In Figs. 13 and  14, considering respectively a high- and a low-complexity classifier, we report the seven most influential training samples on the classification of a randomly chosen test point. For high-complexity classifiers, many of these training samples contain the trigger. In contrast, this is not the case for low-complexity classifiers. These results suggest that low-complexity classifiers rely less on the samples containing the backdoor trigger in their predictions.

Fig. 13
figure 13

Influential training points for a high-complexity classifier. Considering an SVM with \(\lambda =0.01\) trained on MNIST, and with \(\lambda =0.1\) trained on CIFAR10, and Imagenette, we show the top 7 most influential training samples on the prediction of the samples with the red border

Fig. 14
figure 14

Influential training points for low-complexity classifiers. Considering an SVM with \(\lambda =1e-3\) trained on MNIST, and with \(\lambda = 1e-5\) trained on CIFAR10, and Imagenette, we show the top 7 most influential training samples on the prediction of the samples with the red border

4 Related work

We first review the literature about backdoor poisoning attacks and defenses. Afterward, we focus on defenses that increase the robustness against backdoors by reducing the model’s complexity. We conclude the section by discussing the relationship between our proposed framework and influence functions.

Backdoor poisoning. Although backdoors were introduced recently [1, 3, 6], a plethora of backdoor attacks and defenses have been published. For a more detailed overview, we refer the reader to surveys in this area [3, 4, 36]. Despite the quickly-growing literature about this topic, the majority of the previous works [21, 33, 37, 38] study different types of poisoning attacks, i.e., not backdoors. In contrast, only a few works have studied factors that influence the success of this attack. Baluta et al. [39] and [40] studied the relationship between backdoor effectiveness and the percentage of backdoored samples. Salem et al. [41] experimentally investigated the relationship between the backdoor effectiveness and the trigger size. Similarly, Severi et al. [42] have analyzed the correlation between the backdoor success and the attacker’s strength on malware classifiers. Schwarzschild et el. [43] evaluated the performance of backdoor attacks when scaling the dataset size while fixing the poison budget. Finally, Li et al. [44] demonstrated that the backdoor performance is sensitive to the location of the trigger on the attacked image. We instead do not limit our study to neural networks but also study other models. Furthermore, we also investigate other relevant factors, e.g., regularization and visibility, and their interaction at once.

Complexity and backdoor defenses. In this work, we have analyzed the relationship between backdoor effectiveness and different factors, including complexity, controlled via regularization and the RBF kernel’s hyperparameter. In this study, we have demonstrated that reducing complexity by choosing appropriate hyperparameter values improves robustness against backdoors. Our findings align with the insights presented in Frnay et al. [45], who suggested that overfitting avoidance techniques like, e.g., regularization, can offer partial mitigation against random label noise [46, 47]. Expanding upon their discourse, we apply and extend this consideration to the context of backdoor attacks, wherein the noise is intentionally and strategically introduced to deceive the machine learning model. Some of the defenses proposed against backdoors use different techniques to reduce complexity. These techniques include pruning [48, 49], data augmentation [50, 51] and gradient shaping [52]. However, from these works, it remains unclear why reducing complexity alleviates the threat of backdoor poisoning. To the best of our knowledge, our work is the first to investigate this aspect.

Relation to influence functions. Influence functions originated in robust statistics [53] and were later used as a tool to measure the influence of specific training points on the classification output [13, 54]. In our work, we clarify that influence functions naturally descend from the incremental learning formulation in Eq. 1, showing that they quantify the velocity with which the classifier will learn new points. As seen in Sect. 2, they correspond to the partial derivative of the learning curve at the point \(\beta = 0\). Moreover, we leveraged them by proposing a measure, namely the backdoor slope, which quantifies the ability of a classifier to learn backdoors. This measure allowed us to study the factors that impact backdoor effectiveness.

Several defense approaches confirm that the influence functions, or gradients during training, are indeed related to backdoor learning. For example, some defenses are directly based on the gradient [55], based on gradient differences [56, 57], or based on differential privacy that noises the gradients during training [52, 58, 59].

5 Conclusions, limitations and future work

In this paper, we presented a framework to analyze the factors influencing the effectiveness of backdoor poisoning. We carried out experiments on convex learners, also used in transfer-learning scenarios, and neural networks. As in previous work [7, 13], we focus our analysis on two-class classification problems for convex learners, and on multiclass classification when considering neural networks.

Our analysis shows that the effectiveness of backdoor attacks inherently depends on (i) the complexity of the target model, (ii) the fraction of backdoor samples in the training set, and (iii) the size and visibility of the backdoor trigger. By analyzing the influence of the first factor on backdoor learning, we are the first to unveil a region in the hyperparameter space where the accuracy on clean test samples remains high while the accuracy on backdoor samples is low. Specifically, we discovered that the target model needs to significantly increase the complexity of its decision function to learn backdoors, which is only possible when the model is not regularized enough. Conversely, when raising the model’s regularization, we can keep high performance on clean samples and be unaffected by potential backdoor attacks. However, increasing the attacker’s strength, i.e., the last two factors, makes the attack more effective, shrinking this region and thus exposing the model to greater vulnerability. We, therefore, conclude that a prudent strategy to preserve robustness against potential poisoning attacks is to regularize as much as possible during the hyperparameter optimization phase, thereby reducing the backdoor learning slope while ensuring that the trade-off with accuracy remains acceptable.

The study of more factors, like, for example, the dimensionality of the data, is straightforward using the proposed framework but left for future work. Our current results already provide important insights and provide a starting point to derive guidelines for designing models that are more robust against backdoor poisoning.