Content area
The rapid advancement of Artificial Intelligence (AI), particularly Machine Learning (ML) and Deep Learning (DL), has produced high-performance models widely used in various applications, ranging from image recognition and chatbots to autonomous driving and smart grid systems. However, security threats arise from the vulnerabilities of ML models to adversarial attacks and data poisoning, posing risks such as system malfunctions and decision errors. Meanwhile, data privacy concerns arise, especially with personal data being used in model training, which can lead to data breaches. This paper surveys the Adversarial Machine Learning (AML) landscape in modern AI systems, while focusing on the dual aspects of robustness and privacy. Initially, we explore adversarial attacks and defenses using comprehensive taxonomies. Subsequently, we investigate robustness benchmarks alongside open-source AML technologies and software tools that ML system stakeholders can use to develop robust AI systems. Lastly, we delve into the landscape of AML in four industry fields –automotive, digital healthcare, electrical power and energy systems (EPES), and Large Language Model (LLM)-based Natural Language Processing (NLP) systems– analyzing attacks, defenses, and evaluation concepts, thereby offering a holistic view of the modern AI-reliant industry and promoting enhanced ML robustness and privacy preservation in the future.
Introduction
In recent years, the use of Artificial Intelligence (AI) and, more specifically, Machine Learning (ML) techniques has expanded beyond classical applications in the field of computer science into more traditional sectors and heavy industries. In the automotive industry, AI plays a key role in autonomous navigation systems, while in the energy sector, it enables more accurate energy demand prediction. Additionally, in fields such as healthcare, it is utilized for patient diagnosis and care. According to a Eurostat survey in 2021 (Eurostat 2024), 41.17% of the largest enterprises in the EU use AI technologies, but only 12% in heavy industries like electricity, gas, steam, air conditioning, and water supply use AI. Globally, IBM’s 2023 survey of global IT leaders on AI adoption (IBM 2023) found that over 23% of companies in industries such as automotive, energy, and healthcare actively use AI, with 44% exploring its adoption. In particular, 73% of automotive organizations have recently accelerated their investments in AI.
However, the focus on developing more accurate and widely applicable ML models through new techniques and larger datasets has exposed security risks, particularly in critical domains where errors could lead to severe economic, material, or financial damage. Despite the high performance of ML algorithms, particularly Deep Learning (DL), studies highlight vulnerabilities to adversarial examples-carefully crafted inputs with subtle disturbances (often imperceptible to humans in image-based cases)-which can trigger system malfunctions and impaired decisions (Goodfellow et al. 2015). Similar concerns exist regarding the privacy of data used to train ML models, especially when they contain sensitive personal data (e.g., medical records), as models can memorize and leak such data (Rigaki and Garcia 2023).
Consequently, research into the use of AI in critical areas is not only focused on performance and accuracy but also on security aspects, such as the robustness of systems, i.e., their ability to resist such malicious or noisy samples to produce reliable results, as well as the privacy of models and training data and how it can be maintained or guaranteed. For example, in the same IBM survey for 2022 (IBM 2024), the percentage of organizations with no protection against adversarial threats and potential intrusions drops from 59% to 38%, and the percentage of organizations with no data privacy protection throughout their lifecycle drops from 52% to 44%, showing the growing interest in the development of secure, robust, and privacy-preserving AI systems, but also indicating the need for further progress.
Notably, the EU has advanced regulations for AI in critical sectors, starting with 2019 ethics guidelines for trustworthy AI (2019), where among the basic requirements for such systems to be considered reliable are technical robustness and security. This means that they must be accurate, reliable, and repeatable, ensuring that accidental harm can be minimized and prevented, as well as ensuring data privacy and protection. Also, EU-funded research in 2020 (Hamon et al. 2020) contributed to the movement for creating a regulatory framework for AI usage, providing an objective view of the current landscape of AI and ML, reporting on relevant technical risks and limitations, in reference to existing regulations for cybersecurity and data protection (e.g., GDPR), emphasizing the establishment of methodologies for assessing the robustness of these systems. Most importantly, in 2021, the European Commission unveiled the draft AI act, a new proposal for an EU regulatory framework on AI; the final draft of the act (Commission 2024) has already been published and was officially voted on by European Parliament in March 2024. The AI act classifies AI applications into 3 risk categories and allows or prohibits their use accordingly: (i) Prohibition of usage in applications and systems that create unacceptable risk (e.g., social scoring systems, biometric classification systems), (ii) Controlled usage in high-risk applications and systems (e.g., deepfakes, resume scanning and ranking tools), and (iii) Applications with minimal risk not subject to regulation (e.g., AI-based video games, spam filters). In the second category of usage, which is subject to regulation, it refers to the design of high-risk AI systems that must achieve desired levels of accuracy, robustness, and cybersecurity. Also, in this category of high risk, applications include AI in critical infrastructure such as road traffic, electricity, water supply, gas, and heating.
Purpose
Given the current situation, the development of techniques that enhance the security, robustness, and privacy of ML models is deemed urgent, as well as reliable methods for their assessment, to enable the expansion of AI usage in critical areas safely, without the consequences of a misjudgment being severe. In this context, the purpose of this survey is to review and synthesize the existing research and technological developments in the field of robust and privacy-preserving AI systems, including robustness benchmarks and open-source technologies. Aiming to provide a representative view of Adversarial Machine Learning (AML) and its various aspects in critical, data-sensitive applications, we explore four key industry domains that exhibit a unique combination of popularity and critical impact on human life, regulatory relevance, and technological diversity: automotive, healthcare, electrical power and energy systems (EPES), and Large Language Models (LLMs). Note here that this was a deliberate scoping decision aiming to maintain a certain depth in our analysis while also expanding its impact across critical industry fields and emerging AI frontiers.
More specifically, in this survey we address the following three research questions:
What are the most significant adversarial attack and defense techniques in ML, and how are they categorized?
What are the leading benchmarks and open-source tools available for improving and evaluating the robustness and privacy of ML systems?
How are adversarial techniques and defenses researched and applied across four critical industries namely automotive, healthcare, EPES, and LLMs?
What are the current limitations in the robustness and privacy of ML systems, and what future directions can address these challenges?
Methodology
Publications search
For the purposes of our research, we sought out papers, articles, and publications from recognized scientific journals and conferences across multiple scientific fields, due to the review encompassing various areas of research, such as: AI, ML, computer security, LLMs, medical imaging, healthcare, energy, smart grid and transportation. The final bibliography includes publications from reputable and recognized journals in each domain, such as the IEEE Transactions on Neural Networks and Learning Systems in the field of ML, IEEE Security & Privacy in the field of cybersecurity, journals from Nature such as Scientific Reports in the field of medicine, Elsevier journals such as Energy Reports in the field of EPES, and IEEE Transactions on Intelligent Transportation Systems in the automotive domain. Also conferences like ICML, ICLR, and NeurIPS are central in our review paper. More specifically, Table 1 lists the key journals and conferences in which the works of the bibliography are published, categorized by scientific field.
Table 1. Indicative scientific journals and conferences considered in our review
Field | Journal (Publisher) | Conference (Publisher) |
|---|---|---|
AI/ML | Trans. on Neural Networks and Learning Systems (IEEE) Pattern Recognition (Elsevier) Algorithms (MDPI) Found. and Trends in Machine Learning (Now Publishers) Journal of Machine Learning Research (JMLR) ACM Comput. Surv. (ACM) | CVPR (IEEE) ICCV (IEEE) NeurIPS ICLR ICML ICLR |
LLMs | Nature (Springer) ACM Computing Surveys (ACM) IEEE Access (IEEE) | NeurIPS ICLR EMNLP (ACL) |
Security/Privacy | Security & Privacy (IEEE) Comput. Surv. (ACM) Trans. on Services Computing (IEEE) | SP (IEEE) Security (USENIX) CCS (ACM) |
Automotive | Trans. on Intelligent Transportation Systems (IEEE) Trans. on Industrial Informatics (IEEE) Journal of Systems Architecture (Elsevier) | IV (IEEE) PerCom (IEEE) CVPR (IEEE) |
Health | Scientific Reports (Nature) npj Digital Medicine (Nature) Medical Image Analysis (Elsevier) | MICCAI (Springer) MLMI (Springer) MICCAI (Springer) |
Energy | IEEE Transactions on Smart Grid (IEEE) Smart Cities (MDPI) Renewable and Sustainable Energy Reviews (Elsevier) | SmartGridComm (IEEE) ISGT (IEEE) PES (IEEE) |
For publication searches, tools such as Google Scholar and Scopus were used to retrieve sources from various databases, ScienceDirect was utilized for searching and navigating sources published in journals and conferences by Elsevier, and similarly, IEEE Xplore was used for searching and navigating sources published in journals and conferences by IEEE. ArXiv was also extensively used to locate newly published preprints, however their peer-reviewed versions were always put into priority for citation. In case the latter were not available, only works with more than 50 citations in Google Scholar –at the time of writing this paper– were considered. To find publications related to the topics of this study, various keywords were used to cover the entire spectrum of research on AML, the robustness, and privacy of AI systems in the fields examined. To explore the broader area of adversarial attacks and robust AI systems, keywords such as: “adversarial machine learning,” “adversarial examples,” “adversarial attacks,” “adversarial perturbations,” “adversarial robustness,” “adversarial learning,” “adversarial defenses,” and “privacy machine learning” were used. To investigate the application of the above in the fields covered, a combination of these terms was searched along with research and operational areas such as: “energy”, “smart grids”, “healthcare”, “medical imaging”, “autonomous vehicles”, “autonomous driving”, and “large language models”. Note that we only include research published after 2014, which marks the beginning of an era for AML and robustness. Our latest references extend up to early 2025, marking the point at which we concluded citation collection for this paper.
Domain selection
The selection of domains for this study was guided by a combination of their popularity, criticality, and alignment with the goals of adversarial machine learning (AML) research. The rationale and methodology for domain selection are outlined as follows:
Criteria for domain selection
Our study focuses on four domains: automotive, healthcare, EPES, and LLMs. These domains were chosen based on the following considerations:
Criticality: Automotive, healthcare, and EPES intersect with human safety, critical infrastructure, and sustainability, aligning closely with the high-risk classification under the European Union’s AI Act. While LLMs are not classified as high-risk, they represent one of the most rapidly evolving domains with significant societal implications, such as misinformation, bias amplification, and susceptibility to adversarial prompt injections.
Popularity: Current research trends in AML demonstrate significant interest in these domains, placing them among the most discussed and studied areas.
Diversity: By including domains addressing both legacy systems (e.g., EPES, healthcare) and emerging technologies (e.g., LLMs), the selection ensures a balanced exploration of adversarial challenges across diverse contexts.
Methodology for quantifying popularity
To substantiate our domain selection and quantify research interest, we conducted an advanced search on the Scopus database covering the years 2020 to 2024. This period was chosen to ensure alignment with the most recent trends and advancements in AML. The search employed structured queries combining AML-specific terms with domain-specific keywords found in the title, abstract, or keywords of the documents. To further validate our approach, we included major control domains such as finance, telecommunications, agriculture, retail/e-commerce, manufacturing, and public safety/surveillance to provide a comparative perspective. The process involved the following:
AML keywords: The queries included terms such as
"adversarial machine learning" ,"adversarial examples" ,"adversarial attacks" ,"adversarial robustness" , and"adversarial defenses" to ensure comprehensive coverage.Domain-specific keywords: Keywords tailored to each domain were added to refine the search. Note here that additional domains including telecommunications and finance were included for comparison purposes. Specifically:
Energy:
"energy" OR "smart grids" OR "renewable energy" OR "electrical power systems" Automotive:
"automotive" OR "autonomous vehicles" OR "autonomous driving" OR "self-driving cars" OR "driverless vehicles" OR "connected vehicles" OR "autonomous navigation" Healthcare:
"healthcare" OR "medical imaging" OR "diagnosis" OR "medical diagnosis" LLM-driven Natural Language Processing (NLP):
"large language models" OR "LLMs" OR "natural language processing" OR "NLP" OR "language models" Control Domains:
Finance:
"finance" OR "financial systems" OR "banking systems" OR "credit risk analysis" OR "stock market" OR "digital banking" OR "payment systems" OR "blockchain" OR "cryptocurrency" Telecommunications:
"telecommunications" OR "5 G networks" OR "communication systems" OR "wireless networks" OR "network protocols" OR "IoT networks" OR "satellite communication" OR "optical networks" OR "6 G networks" Agriculture:
"agriculture" OR "smart farming" OR "precision agriculture" OR "crop monitoring" Retail/E-commerce:
"retail" OR "e-commerce" OR "online shopping" OR "recommendation systems" Manufacturing:
"manufacturing" OR "predictive maintenance" OR "supply chain" Public Safety/Surveillance:
"public safety" OR "surveillance" OR "facial recognition" OR "biometric recognition"
Search Query Structure: Queries were structured as follows.
Popularity results
Our Scopus search yielded, for each one of the examined industries, the publication counts spanning within the years 2020–2024 as illustrated in Fig. 1. These quantitative results validate our domain selection process, as automotive, healthcare, EPES, and LLMs seem to dominate the current AML research.
[See PDF for image]
Fig. 1
Distribution of publication counts across industries
Related work
The field of AML and robust learning has rapidly evolved, driven by the increasing integration of AI systems into critical sectors. AML focuses on understanding and mitigating the threats posed by adversarial attacks, where malicious inputs are crafted to deceive models into making incorrect predictions. Concurrently, robust learning aims to enhance the resilience of these models against such attacks, ensuring reliable performance under a variety of adversarial conditions. Alongside these developments, privacy-preserving ML has gained prominence, addressing the need to protect both sensitive data used in training models from unauthorized access and inference and model data, parameters and features. This dual focus on robustness and privacy is essential for developing secure AI systems that can be trusted in high-stakes applications, highlighting the dynamic and interdisciplinary nature of current research efforts in this domain.
Existing review papers on robust ML primarily contribute by providing comprehensive overviews of the landscape of adversarial attacks and defenses. These reviews often delve into theoretical underpinnings, empirical evaluations, and comparative analyses of robustness methods, highlighting key advancements and emerging trends. For example, Biggio and Roli (2018) offer a comprehensive review on the field of AML from 2004 to 2018, providing also a historical background on the roots of the field, even before the big outbreak of ML and AI. The authors categorize and evaluate various techniques designed to enhance the robustness of ML models, offering insights into the strengths and weaknesses of different approaches. Another review (Zhao et al. 2022) offers a study and analysis on a total of 78 papers published between 2016 and 2021, pointing the limitations to adversarial training methods and robust optimizations. Additionally, recent work highlights the importance of practical applications in adversarial defense strategies. For example, Barik and Misra (2024) focus on addressing adversarial attacks in real-world settings, demonstrating the challenges of transitioning from controlled development environments to deployment. They also highlight the importance of adapting Intrusion Detection Systems (IDS) to counter evolving adversarial strategies effectively. Other existing work, focuses on the field of privacy-preserving ML, mapping out the evolving privacy threat landscape and underscoring the need for ongoing development of robust privacy-preserving techniques, identifying key challenges, proposing defense strategies, and pinpointing areas for future research. For example, Al-Rubaie and Chang (2019) provide a comprehensive overview of privacy-preserving methods like Differential Privacy (DP), Homomorphic Encryption (HE), and Federated Learning (FL), highlighting their effectiveness and practical implementations and Rigaki and Garcia (2023) focus on the adversarial aspects of privacy, categorizing over 45 privacy attacks and analyzing various attack vectors such as membership inference and model inversion.
A significant portion of related research focus strictly on the risks of adversarial attacks and methods for increasing robustness, without reference to any specific field. For example, Papernot et al. (2018b) investigate risks and methods for maintaining security and privacy in ML, while Machado et al. (2021) explore the latest and most prevalent methods in adversarial learning for image classification tasks. In similar context, Madry et al. (2018) investigate the adversarial robustness of NNs using robust optimization, making an important step towards enhancing the reliability of DL models in critical applications. While these papers provide valuable insights, they lack practical context by not addressing specific application areas.
Other papers focus on individual fields, such as healthcare, EPES, autonomous driving, and LLMs providing in-depth analyses of secure and robust ML within these domains. For example, Qayyum et al. (2021) research secure and robust ML in healthcare, and Deng et al. (2020) while analyzes adversarial attacks and defenses in autonomous driving models. Recent studies have further expanded on the challenges posed by adversarial threats in AI-enabled autonomous vehicles. For instance, Girdhar et al. (2023) systematically outline AI vulnerabilities and cybersecurity challenges in autonomous vehicles, emphasizing the risks introduced by adversarial attacks and proposing defensive strategies to enhance resilience. In the context of smart grids, Hao and Tao (2022) provides a comprehensive survey of adversarial attacks targeting DL models deployed in smart grids, outlining the vulnerabilities, proposing countermeasures, and emphasizing the ongoing need for vigilance against evolving threats. Similarly, Zibaeirad et al. (2024) reviews attack methodologies, defense strategies, and emerging threats, emphasizing the role of machine learning-based mitigation strategies and the potential of LLMs for enhanced security. Finally, Kumar (2024) covers a spectrum of attack enhancement techniques as well as methods for strengthening LLMs. However, these papers do not offer a cross-industry perspective, therefore limiting their applicability to broader contexts.
Additionally, due to the rapid evolution in the field of robust ML, new vulnerabilities and more sophisticated attacks are continually being discovered, making many existing reviews that focus on older adversarial defenses and robustness methods outdated. For example, Papernot et al. (2016b), present distillation as a defense to adversarial perturbations against Deep Neural Networks (DNNs), but Carlini and Wagner (2017a) demonstrate that it does not significantly increase the robustness of Neural Networks (NNs) by introducing three new attack algorithms. Also, Carlini and Wagner (2017b) show that the authors show that most adversarial example detection methods can be easily defeated by constructing new loss functions.
In summary, the current literature in the domain of robust and privacy-preserving AI entails the following limitations:
Lack of domain-specific context: Many existing papers focus on adversarial risks and robustness techniques without considering the specific contexts and unique requirements of different industry fields, limiting their practical applicability.
Narrow and vertical focus on individual fields: Some papers concentrate on a single domain, such as healthcare or autonomous driving, which prevents the generalizability of their findings and insights to other critical sectors.
Outdated techniques: Several reviews analyze methods that have since been deprecated or broken, failing to provide up-to-date guidance on the most effective and current techniques for adversarial defenses and robustness.
Limited coverage of software tools: Existing reviews often overlook the practical aspect of state-of-the-art software tools used for defending against adversarial attacks and benchmarking robustness, offering insufficient guidance for real-world implementation.
Fragmented research landscape: The research landscape is fragmented, with papers either focusing exclusively on robustness or privacy, but rarely both, leaving a gap in understanding the interplay and combined importance of these properties in ensuring secure and robust ML systems.
Contributions
The key contributions of our survey are as follows:
Cross-domain investigation: This research provides an in-depth examination of adversarial attacks and defense strategies across four critical industries: healthcare, autonomous vehicles, EPES, and LLM-driven NLP. To the best of our knowledge, this is the first study to collectively analyze multiple critical AI-reliant sectors-each characterized by unique data types, regulatory demands, and security challenges. By highlighting AML’s impact on high-risk AI industries, where Trustworthy AI principles (including robustness and privacy) are crucial, this cross-industry analysis allows for a broader understanding of AML challenges and solutions. It uncovers domain-specific vulnerabilities, defense strategies, and evaluation concepts, offering a holistic perspective that envisages to support the development of generalizable tools and frameworks adaptable across multiple fields.
Comprehensive and up-to-date classification and presentation: We provide a comprehensive and analytical classification of adversarial attacks and defenses, accompanied by useful insights. In this context, we focus on the most recent techniques in adversarial attacks and defenses, acknowledging that some past methods analyzed in previous papers have been deprecated or "broken", ensuring the relevance and applicability of the reviewed methods on all domains.
Listing of state-of-the-art software tools: We incorporate and reference state-of-the-art frameworks technologies used for performing adversarial attacks and defending, evaluating, and benchmarking ML model robustness and privacy against them, also providing practical insights into their effectiveness and applicability in real-world scenarios. Such methods and tools can be used by researchers, ML engineers, and data scientists to develop and effectively test robust ML applications, therefore addressing the major gap that exists between academic research and business sector utilization.
Scope in robust and privacy-preserving ML systems: We address the dual aspects of robustness and privacy in ML systems, emphasizing their importance as essential properties for relevant applications in critical sectors, thereby promoting the development of both robust and privacy-preserving AI technologies.
Structure of the document
Section 2 sets the background on AML establishing the adversarial attacks and defenses taxonomies required for the rest of our paper. Subsequently, Sect. 3 investigates the set of evaluation tools and benchmarks for measuring adversarial robustness alongside the current state-of-the-art in open-source technologies and libraries for enhancing and evaluating the robustness and privacy of ML models. Section 4 then delves into domain-specific analysis, covering sector-wise adversarial attacks, defenses, and evaluation concepts. Section 5 discusses the study’s findings, limitations, and general future directions, focusing on real-world AML validation, cross-domain impact, and robustness-performance trade-offs. Finally, Sect. 6 wraps up our paper with concluding remarks.
Adversarial machine learning
In this section, we attempt to review the most widespread AML attacks and defenses, providing comprehensive taxonomies for them. This section addresses adversarial ML landscape alongside corresponding prevention, mitigation measures, providing background in the domain.
Adversarial attacks
As illustrated in Fig. 2, adversarial attacks can be primarily categorized based on:
The level of the adversary’s knowledge of the targeted system to: (i) white-box attacks, where the attacker has complete knowledge and access to the model being attacked; (ii) black-box attacks, where the attacker has no knowledge of the internal workings of the model being attacked; (iii) gray-box attacks where the attacker has partial knowledge about the model or the data (Biggio and Roli 2018).
The adversary’s specificity to: (i) targeted attacks, the adversary aims to have the input misclassified into a specific class or to steal the underlying algorithm; (ii) untargeted attacks, where the adversary aims to missclassify the input without restrictions on what the new class will be, just to produce an ambivalent result (Nicolae et al. 2018). With respect to targeted attacks, they can be further categorized based on if they target the model or the training data. In cases of untargeted attacks, they can be further categorized to malicious attacks but also unintentional failures that may occur from non-malicious users due to poor design, inadequate personnel training, or lack of protection against adversarial attacks (Kumar et al. 2019).
The adversary’s objective to: (i) integrity violation, which involves compromising the underlying function of the AI system [e.g. misclassification throughout adversarial examples (Sharif et al. 2016)] without compromising it’s basic functionality (e.g. ability to respond to inference requests); (ii) availability violation, which jeopardizes the usual functions of the AI system provided to its users through denial of service of either the AI models per se or its backbone services leading to increased uncertainty of their predictions; (iii) privacy violation, which aims at obtaining personal information about the system, its users, or its data (Biggio and Roli 2018; Machado et al. 2021).
The adversary’s influence to: (i) evasion attacks, which occur during the testing or inference phase of the model, and the attacker aims to manipulate the input data to create an error in an ML system (Biggio and Roli 2018); (ii) poisoning attacks, which occur during the training phase of the model, by introducing tampered data samples or parameters into the original training process aiming at reducing the overall performance of the model or introducing backdoors to be exploited in the future (Biggio and Roli 2018); (iii) privacy attacks, which occur during the inference or deployment phase and are typically launched by attackers with query access to an ML model. Privacy attacks can be further divided into data privacy attacks and model privacy attacks (Vassilev et al. 2024).
[See PDF for image]
Fig. 2
First-level taxonomy of adversarial attacks
The following subsections aim to briefly revisit the most popular adversarial attack types primarily categorized under the adversary’s influence taxonomy. Also, we find it useful to provide brief insights on the specific attacks, as they are referenced quite often within chapters 3 and 4 that are central in our work.
Evasion attacks
Regarding evasion attacks, we further use the white-box, black-box taxonomy as the knowledge of the model by the adversarial severely affects the nature of the techniques and tactics required to compromise it. Also, Table 2 provides a summary of the evasion attacks analyzed below as well as several other state-of-the-art (white-box, black-box) attacks from the literature.
White-box setting
The Fast Gradient Sign Method (FGSM) (Goodfellow et al. 2015) is one of the first and most popular adversarial attacks proposed and used to deceive NNs. It can function as both a targeted and untargeted attack and can work for , , and adversarial perturbations (Nicolae et al. 2018). The FGSM works by using the gradients of a NN to create adversarial examples. Given an input image, the method calculates the gradient of the cost function to create a new image that maximizes the loss, which we call an adversarial image (Goodfellow et al. 2015). Its advantage is that it is very efficient to compute, requiring only one evaluation of the gradient, and the attack can be applied directly to a batch of inputs (Nicolae et al. 2018), although it is typically not as strong as other methods analyzed below (Machado et al. 2021). Being the first method of producing adversarial examples, FGSM demonstrates the fragility of ML systems. Popular variations and extensions of FGSM are Basic Iterative Method (BIM), Iterative Targeted Fast Gradient Sign Method (IT-FGSM) (Alexey et al. 2018), and Projected Gradient Descent (PGD) (Madry et al. 2018). Additionally, even though FGSM is a white-box attack, its principles have been adapted for black-box settings by using the transferability phenomenon of adversarial examples. In this case, adversarial examples are crafted on a surrogate model and transferred to the target model, demonstrating FGSM’s broader applicability beyond the white-box scenario (Roshan and Zafar 2024).
The Jacobian Saliency Map Attack (JSMA) (Papernot et al. 2016a) is a targeted, gradient-based attack that identifies the most significant features of the input data in relation to the outputs by analyzing the Jacobian matrix, and modifies these features to create adversarial examples.
After constructing a saliency map through the Jacobian matrix of the model, the attack greedily modifies the most significant pixel at each iteration, one at a time, until the targeted misclassification is achieved or the total number of modified elements exceeds a specified budget (Jin et al. 2019).
The Carlini & Wagner (C&W) (Carlini and Wagner 2017a) is originally a targeted attack created to overcome the strongest defense technique until then named Defense Distillation (Papernot et al. 2016b). However, it has broader applications and is one of the most powerful state-of-the-art white-box adversarial attacks. This attack comes in three forms, each designed to work for , , and perturbation sizes. Notably, the attack was the first published attack to cause misclassification on ImageNet (Deng et al. 2009). These attacks frame the problem of adversarial attack as an optimization problem, where instead of maximizing a cost function under a given perturbation constraint (as done with PGD, for example), they aim to find the smallest successful adversarial perturbation (Tramer et al. 2020).
DeepFool (Moosavi-Dezfooli et al. 2016) is an untargeted attack that iteratively finds the minimal perturbation needed to move an input sample beyond the decision boundary of the model, optimized for norms, gradually shifting the input towards misclassification. The basic idea is that DeepFool approaches, at each iteration, the solution of this problem by considering the classifier to be linear and thus finds the optimal solution to the simplified problem, constructing the adversarial example. However, since NNs are not actually linear, steps are taken towards the optimal solution, and the process is repeated until a real adversarial example is found, causing it to cross the decision boundary (Carlini and Wagner 2017a). Similarly, this process can be extended to multi-class classifiers. This is an efficient method that effectively calculates perturbations that deceive DNNs and is also found to be a reliable method for evaluating the robustness of these classifiers.
Universal Adversarial Perturbations (Moosavi-Dezfooli et al. 2017) is a special type of untargeted attack, which produces a single adversarial example to cause misclassification on any given data sample, making it extremely transferable across different samples. Universal perturbations have been shown to be more impactful than any other type (such as random perturbation), as they exploit geometric correlations between different points on the decision boundary of each classifier (Moosavi-Dezfooli et al. 2017).
Black-box setting
In black-box model evasion attacks, the following approaches are usually followed for generating black-box adversarial samples: (i) transfer-based methods with surrogate or substitute model construction, (ii) score-based methods, and (iii) decision-based methods (Deng et al. 2021).
The Zeroth Order Optimization Attack (ZOO) (Chen et al. 2017) is one of the first black-box attacks, which can operate both as a targeted and untargeted attack, and can be considered a black-box version of C&W. It relies on queries regarding the classifier’s output probabilities. ZOO is formulated as an optimization problem using zeroth-order optimization techniques, without needing to train a substitute model, thus enabling pseudo-backpropagation on the target model.
ZOO employs techniques like dimensionality reduction of the attack space, hierarchical attacks, and importance sampling, achieving notable performance on large datasets like ImageNet, whereas other black-box techniques based on surrogate models perform satisfactorily on small models and datasets like MNIST (Deng 2012).
The Boundary Attack (BA) (Brendel et al. 2018) is a black-box attack that can function both as a targeted and untargeted attack, requiring only output class queries, rather than logits or model output probabilities. This attack is decision-based which makes it: (i) more relevant compared to score-based attacks, as confidence scores or logits are rarely accessible in real-world environments, (ii) more robust against typical defenses like gradient masking or robust training than other attacks, and (iii) requiring much less model information than transfer-based attacks and simpler to implement.
This attack is simple at its core, requires almost no hyperparameter tuning, does not rely on a substitute model, and is competitive with the best white-box gradient-based attacks in both targeted and untargeted scenarios in computer vision tasks and large datasets such as ImageNet (Deng et al. 2009).
The HopSkipJump Attack (HSJA) (Chen et al. 2020b) is an advanced version of BA that requires only label predictions. It is also a benchmark decision-based attack that requires far fewer queries to compromise the ML system, has no hyperparameters, and can be used as a first step in evaluating new defenses. HSJA is a family of iterative attacks that can operate both in a targeted and untargeted way and are optimized for either or perturbation sizes. In each iteration, three steps are performed: (i) gradient direction estimation, (ii) step size search through geometric progression, and (iii) boundary search via binary search.
In addition to the above popular black-box attacks, recent works focus on evaluating the vulnerability of No-Reference Image Quality Assessment (NR-IQA) models to adversarial attacks, as these models are increasingly used in domains such as healthcare and automotive applications. One such example is the Black-box Adversarial Attack on NR-IQA Models (BAA-IQA) (Ran et al. 2025), which is a novel black-box attack that aims to exploit the vulnerabilities of NR-IQA models. The attack formulates the problem as maximizing the deviation between the quality scores predicted for the original and perturbed images, while ensuring that the distortion introduced in the adversarial example remains within a specified threshold to preserve visual quality. A key component of the attack is the design of a bi-directional loss function, which drives the quality scores of the perturbed image to the opposite direction of the original image’s score, maximizing the deviation. The attack method employs an improved random search mechanism for perturbation optimization, which allows flexibility and effectiveness in generating adversarial examples.
Table 2. Listing of state-of-the-art evasion attacks (white-box, black-box) with their main characteristics
Attack | Knowledge | Specificity | Perturbation | Scope | Learning |
|---|---|---|---|---|---|
FGSM (Goodfellow et al. 2015) | White-box | Targeted | Imaging | One shot | |
L-BFGS (Szegedy et al. 2014) | White-box | Targeted | Imaging | One shot | |
BIM (Kurakin et al. 2017) | White-box | Untargeted | Imaging | Iterative | |
ILCM (Kurakin et al. 2017) | White-box | Targeted | Imaging | Iterative | |
PGD (Madry et al. 2018) | White-box | Targeted | Imaging | Iterative | |
JSMA (Papernot et al. 2016a) | White-box | Targeted | Imaging | Iterative | |
C&W (Carlini and Wagner 2017a) | White-box | Targeted, untargeted | Imaging | Iterative | |
DeepFool Moosavi-(Dezfooli et al. 2016) | White-box | Untargeted | Imaging | Iterative | |
Universal Moosavi-(Dezfooli et al. 2017) | White-box | Untargeted | Universal | Iterative | |
EAD (Chen et al. 2018c) | White-box | Targeted, untargeted | Imaging | Iterative | |
ZOO (Chen et al. 2017) | Black-box | Targeted, untargeted | Imaging | Iterative | |
BA (Brendel et al. 2018) | Black-box | Targeted, untargeted | Imaging | Iterative | |
HSJA (Chen et al. 2020b) | Black-box | Targeted, untargeted | Imaging | Iterative | |
One-Pixel (Su et al. 2019) | Black-box | Targeted, untargeted | Imaging | Iterative | |
UPSET (Sarkar et al. 2017) | Black-box | Targeted | Universal | Iterative | |
ANGRI (Sarkar et al. 2017) | Black-box | Targeted | Imaging | Iterative | |
Houdini (Cisse et al. 2017a) | Black-box | Targeted | Imaging | Iterative | |
BAA-IQA (Ran et al. 2025) | Black-box | Targeted, untargeted | Imaging | Iterative | |
AdvGAN (Xiao et al. 2018) | Gray-box, Black-box | Targeted | Imaging | Iterative |
Privacy attacks
Privacy attacks against ML systems attempt to extract information either about the model or the training data. In this section, we describe the three main categories of attacks aimed at compromising privacy either of the model or the data. These attacks can be applied to various ML models such as NNs or decision trees (Tramèr et al. 2016).
Model extraction
In model extraction attacks, attackers reproduce a model by sending queries to imitate its operation, allowing for feature recovery and inferences about the training data (Kumar et al. 2019). These attacks need no specific permissions, since queries mimic valid user inputs. Objectives differ, such as circumventing detection mechanisms or obtaining a competitive advantage by replicating commercial frameworks (Juuti et al. 2019).
Methods used in most attacks in this category include: (i) equation solving, where a model returns class probabilities at its output, queries can be crafted to determine the unknown variables of the model (Kumar et al. 2019; Tramèr et al. 2016); (ii) path finding, which exploits model outputs to extract decisions made by a decision tree when classifying an input (applicable to decision trees and regression trees) (Tramèr et al. 2016); (iii) transferability attack, which involves training a substitute model using outputs from queries to the original model, aiming to create adversarial examples that can be then transferred to the original model (Tramèr et al. 2017; Papernot et al. 2017).
To assess a model extraction attack, two primary metrics are employed: (i) effectiveness–for feature extraction attacks, this gauges the resemblance of the extracted feature values, whereas for behavior mimicry, it evaluates the accuracy, fidelity, and transferability of the resulting model; (ii) efficiency–generally measuring the queries (query budget) and the duration needed to reproduce the model.
Model inversion
In model inversion attacks, private functions of ML models can be targeted using specific queries to recreate typically unreachable private training data. These attacks, referred to as hill climbing attacks in biometrics (Galbally et al. 2010), involve discovering inputs that increase confidence levels for particular classifications (Kumar et al. 2019). The main objectives are to extract personal or sensitive information. Similar to model extraction, no unique permissions are needed; attackers behave like authorized users by entering designed queries and analyzing predictions to obtain confidential features (Kumar et al. 2019). The result typically takes the form of a representation of categorized inputs, akin to saliency maps (Papernot et al. 2018b), instead of individual training data points.
Different methods for input reconstruction involve: (i) evaluating sensitive feature values through available non-sensitive values and labels; (ii) using target labels and additional information to recreate inputs by addressing optimization challenges (e.g., via gradient descent); and (iii) using Generative Adversarial Networks (GANs) to acquire auxiliary data for enhanced training data reconstruction (Rigaki and Garcia 2023). The majority of model inversion attacks take place in white-box situations because reconstructing data in black-box situations involves complex gradient computations (Fredrikson et al. 2015). Certain black-box techniques employ an additional classifier to reverse model outputs, similar to autoencoders, yielding robust reconstructions when the complete prediction vector is available (Rigaki and Garcia 2023). The first example of model inversion in healthcare demonstrated that genomic data could be inferred from a medication dosage model, emphasizing privacy threats when sensitive data-based models are used (Fredrikson et al. 2014). However, it is still uncertain whether the vulnerability originates from the model itself or from significant correlations with available auxiliary data (Papernot et al. 2018b).
Membership inference attack
In membership inference attacks, adversaries attempt to determine if a given data point was part of the training dataset of a ML model. These attacks leverage the output probabilities of the model to infer membership status of a sample by comparing predictions for samples used during training (members) against those that were not (non-members) (Kumar et al. 2019). The goals of such attacks include extracting information about training data, such as whether a specific individual’s medical record was used to train a model related to a specific disease (Al-Rubaie and Chang 2019). Similar to other privacy attacks, no special privileges are required to conduct membership inference.
Membership inference can occur under both white-box and black-box scenarios. However, in both cases, the distribution of training data may be considered available to adversaries, if a shadow dataset is available (Hu et al. 2022).
Depending on the approach, membership inference attacks are categorized into: (i) classifier-based, where a binary classifier distinguishes between members and non-members (Hu et al. 2022); and (ii) metric-based, which compares prediction vectors against a threshold for membership status. Key measurements include correctness, loss, confidence, and entropy-based metrics (Hu et al. 2022). Among the first membership inference attacks, shadow training functions efficiently in both white-box and black-box settings, with white-box providing improved speed and precision (Shokri et al. 2017; Hu et al. 2022). More recent developments in attack techniques relax the assumptions regarding the target model, boosting effectiveness. For example, (Salem et al. 2019) demonstrates that just one shadow and attack model may be adequate, and data transfer attacks can circumvent synthetic data generation while achieving comparable performance.
To explain why ML models are vulnerable to membership inference attacks, we can condense the reasons into three main points:
Overfitting: ML models with excessive parameters, such as DNNs, frequently overfit because of their intricate designs and insufficient training data, causing them to retain noise instead of generalizing over the data distribution (Hu et al. 2022; Shokri et al. 2017).
By-design vulnerable models: Different models exhibit varying levels of attack susceptibility, with specific models like decision trees being more exposed due to their decision boundaries (Truex et al. 2021).
Limited data diversity: Models that are trained on a wide array of representative data generally generalize more effectively, lowering their vulnerability to attacks (Hu et al. 2022) compared to models trained under data scarcity conditions.
Poisoning attacks
Poisoning attacks occur when attackers manipulate a model’s training data or parameters, aiming to degrade the model’s overall performance. In this section we describe the two main categories of poisoning attacks; model poisoning and data poisoning.
Model poisoning
Model poisoning directly alters the model’s parameters or architecture after it has been trained or during its training process. This kind of adversarial attack can greatly weaken the effectiveness of the model, resulting in partial outcomes, decreased performance, and even backdoors that create weak spots in cybersecurity and facilitate future exploits. In general, model poisoning can be subcategorized to:
Parameter manipulation attacks: They occur when attackers change the weights or biases of a ML model directly. For example, a malicious actor could manipulate the training settings in a way that leads the model to consistently identify specific inputs incorrectly, essentially taking control of its decision-making capabilities. Achieving this is possible by utilizing different methods, usually involving backdoors in model weights (Kurita et al. 2020; Li et al. 2021).
FL vulnerabilities: Model poisoning is particularly relevant in FL environments, where multiple devices contribute to training a shared model (Tian et al. 2022a; Fang et al. 2020; Tomsett et al. 2019). Attackers can inject malicious updates from compromised devices, skewing the overall model performance (Bagdasaryan et al. 2020). This can lead to unexpected behaviors and vulnerabilities in applications relying distributed learning.
Data poisoning
Data poisoning attacks involve changing the training dataset of ML models to lower their effectiveness or modify their actions. Attackers can manipulate the model’s learning process by injecting harmful data, leading it to inaccurate predictions or classifications. Data poisoning attacks can be categorized as follows.
Targeted data poisoning includes adding specific examples to the training data aiming to deceive the model about particular categories. Methods in this group are as subcategorized as: (i) Backdoor attacks: Attackers incorporate certain triggers into the training data which result in the model misclassifying inputs with those triggers. For instance, a model that has been trained to identify cats could incorrectly categorize images with a specific watermark as dogs if that watermark is visible (Gu et al. 2017). This method can result in serious security risks, particularly in technologies such as facial recognition. (ii) Label flipping: This method consists of altering the labels of specific training examples to deceive the model. For example, a malicious actor could change the tags of harmless data in a spam filter, leading the system to pick up wrong patterns. These actions may lead to more false positives, which can erode confidence in automated systems. Label flipping attacks are directed towards the labels assigned to the training examples. Methods within this group can be further subdivided to: (a) Selective label flipping: Attackers strategically flip labels to target specific areas within the model’s decision boundaries to achieve optimal impact on model performance (Kumar et al. 2019). This could result in disastrous breakdowns in essential tasks such as medical assessments; (b) Strategic Label Corruption: This technique entails manipulating labels to cause maximum confusion between classes, thus hindering the model’s ability to learn precise representations (Tran et al. 2018). The consequence may be models that are easily deceived by adversarial inputs.
Untargeted data poisoning inject a larger volume of random noise or incorrect labels into the dataset without targeting specific outcomes. Random data poisoning techniques include: (i) Random noise injection, where attackers add random data points that do not correspond to any real-world examples, effectively diluting the quality of the training set and leading to poor generalization by the model (Steinhardt et al. 2017). This can significantly degrade performance across various tasks; (ii) Outlier introduction, where outlier data points are introduced allowing attackers to disrupt the learning process and force the model to accommodate these anomalies (Chen et al. 2018b). The outcome is frequently a model that lacks the ability to generalize effectively, resulting in outputs that are not trustworthy.
Real-world examples of adversarial attacks
Real-world adversarial attacks highlight vulnerabilities in ML systems. Athalye et al. (2018) demonstrated that 3D-printed adversarial objects, like turtles misclassified as rifles, can deceive classifiers physically. Similarly, Brown et al. (2018) introduced adversarial patches-stickers that, when placed near objects like bananas, cause models to misclassify them as toasters. In a data poisoning attack, Microsoft’s chatbot Tay was manipulated by users into generating offensive content and hate speech (Wolf et al. 2017), underscoring the risks of learning from unfiltered data. Additionally, the VirusTotal poisoning incident (Koundinya et al. 2024) involved attackers submitting mutated malware samples to corrupt antivirus datasets, leading to misclassification of benign files as ransomware.
Defenses against adversarial attacks
As illustrated in Fig. 3, defenses against adversarial attacks can be categorized as follows.
By Objective: Depending on the primary objective of addressing the attacks, we can categorize the defenses into: i) Reactive defenses, that are designed to address already executed attacks based on previous attacks on the model. They adapt according to the attacks and the adversary’s behavior (Biggio and Roli 2018). These are usually techniques for detecting adversarial examples using external NNs (Deng et al. 2020). (ii) Proactive Defenses: Aiming to address future attacks, they are designed by modeling the adversary and their attacks to assess their impact and develop appropriate countermeasures (Biggio and Roli 2018). These are typically techniques to enhance the robustness of the NN against adversarial examples (Deng et al. 2020). Proactive defenses can be further subdivided to: (a) Security-by-design defenses, which involve improving the security of a system by designing it from the ground up to be secure, mainly against white-box attacks. (b) Security-by-obscurity defenses, which involve improving the security of a system by hiding information from the attackers to make it secure, mainly against gray-box/black-box attacks.
By mitigation approach: Depending on the approach followed for addressing attacks, defenses can be categorized to: (i) Model hardening, which involves improving the model itself to enhance robustness compared to specific metrics. Model hardening techniques are linked with proactiveness and security-by-design and can be futher subdivided to: (a) training data enhancement techniques through adversarial examples like adversarial training, (b) regularization techniques during training (Ros and Doshi-Velez 2018), and (c) techniques modifying classifier architecture elements (Zantedeschi et al. 2017). Model hardening techniques are usually recommended and yield better results against white-box attacks. (ii) Data pre-processing, which achieve higher robustness by transforming classifier inputs and labels during testing or training to filter noise added to data or remove it if not correctable. Data pre-processing techniques are reactive defenses. They can be further subdivided to: (a) techniques using random transformations, especially for images where transformations like cropping, resolution changes, bit-depth reduction, JPEG compression, or minimizing total variance can be applied (Guo et al. 2018); (b) input dimension reduction techniques like feature squeezing (Xu et al. 2018), (c) techniques using another model trained to map adversarial examples closer to the manifold of normal samples and ’clean’ them from adversarial perturbations using autoencoders. These techniques are usually recommended and yield better results against black-box and gray-box attacks. iii) Runtime detection: Runtime detection techniques extend the initial model with a detector that checks if a given input is adversarial during ML system operation (Nicolae et al. 2018). Runtime detection techniques are reactive defenses and are usually recommended and yield better results against black-box and gray-box attacks.
By Application Phase: Depending on their application phase adversarial ML defenses can be categorized as follows: (i) Defenses against training-time attacks, which address poisoning attacks occurring during model training (Papernot et al. 2018b). Most techniques here are based on the fact that poisoning samples are usually outliers from expected input distributions. Techniques used to address these samples may include (a) data sanitization techniques to address outliers (i.e., attack detection and removal), (b) randomization techniques during training such as adding random noise to make the model less sensitive to minor disturbances (Biggio and Roli 2018), (c) enhanced model training techniques to enhance against adversarial examples, such as adversarial training, and (d) verification of data sources and models, which often circulate freely on the internet and may contain poisoned samples or serve as grounds for backdoor attacks. (ii) Defenses against inference-time attacks that mainly address evasion attacks occurring during model inference (Papernot et al. 2018b). Most techniques here are based on adversarial training or adversarial sample detection. Techniques used to address these samples may include: (a) heuristic model training techniques to enhance against adversarial examples like adversarial training, (b) certified defenses techniques, (c) adversarial example detection techniques, and (d) classifier ensembles, which involve using multiple models and combining their outputs (Biggio and Roli 2018).
[See PDF for image]
Fig. 3
First-level taxonomy of defenses against adversarial attacks
In the following sections we proceed by following the adversarial attacks taxonomy, aiming to match the defense methods to the specific adversarial attack types described in Sect. 2.1.
Defenses against evasion attacks
Several defense techniques have been proposed in the literature against evasion attacks. Some of them are soundly backed by solid theoretical and mathematical foundations, while others are more empirical. However, all of them all have their own limitations. In Table 3, we can find a summary of the state-of-the-art defenses presented below, following the aforementioned taxonomy.
Adversarial training (Goodfellow et al. 2015) is one of the first proposed techniques to address evasion. It involves using adversarial examples generated by methods such as FGSM, along with clean examples to train a new model that is resilient to adversarial examples. Among the defenses proposed in the literature, adversarial training is one of the most reliable and extensively evaluated defenses (Carlini et al. 2019). Its advantage lies in extending the model’s training to handle samples with perturbations, thus shifting the decision boundaries of the model. However, one of its main drawbacks is the need for retraining the model to be resilient against various types of perturbations, increasing the time and complexity of training. Certified training methods add up to adversarial training using mathematical techniques to prove the robustness of the trained model against an upper bound of perturbations (Cohen et al. 2019b). Said methods provide stronger security guarantees compared to simple adversarial training but typically have lower performance against common attacks and significantly increase the complexity of training.
Network regularization is another defensive technique that typically hardens a model against model evasion through a regularization layer to the initial objective function (Yan et al. 2018). DeepDefense (Yan et al. 2018) is one such method, which unlike other similar techniques that make approximations and optimize with loose bounds, incorporates a disturbance-based regularizer into the objective function of a classifier. Another method uses denoising autoencoders and deep contractive networks (Gu and Rigazio 2015) for pre-processing data, helping to remove significant portions of adversarial noise. Finally, Parseval Networks (Cisse et al. 2017b) are a type of DNN in which a regularization method is imposed on a layer of the network, responsible for reducing the model’s overall sensitivity to small disturbances, carefully controlling the overall Lipschitz constant of the model.
Defense Distillation (Papernot et al. 2016b) is a forward-thinking approach that utilizes model distillation (Hinton et al. 2015b), which was initially developed to condense large models into more compact versions. This method utilizes the class probabilities of the large model to train the smaller model, demonstrating potential in decreasing attack success rates from 95% to 0.5%. This defense, applicable to any feed-forward neural network (FNN) with one retraining step, seeks to improve resilience against adversarial instances. However, it was later found to fail against sophisticated attacks such as C&W and high-confidence adversarial examples from other susceptible models in black-box scenarios (Carlini and Wagner 2017a). An alternative to Defense Distillation is Label SmoothingWarde-Farley adversarial perturbations (Goodfellow 2016). Despite it’s robustness against FGSM, it has been shown to be less effective against advanced attacks like Jacobian-based iterative attacks (Papernot et al. 2018b).
Feature Squeezing (Xu et al. 2018) is a defensive approach that reacts to threats through data pre-processing, improving DNN models by detecting adversarial examples. It reduces the adversary’s search space by merging samples with unique feature vectors into one representation. Through a comparison of the model’s forecasts on the original and "squeezed" samples, it successfully detects adversarial examples with minimal false positives, especially in computer vision tasks. Although it may be paired with additional defenses to enhance detection rates, subsequent studies have shown it to be less effective than previously indicated (He et al. 2017).
Input Transformations (or Adversarial Transformations) are a category of data pre-processing techniques that apply transformations to input features either to reduce the impact of adversarial perturbations or for use in augmenting the training dataset to train a more robust model. One such technique is JPEG Compression (Das et al. 2017), which aims at removing high-frequency signal components within square blocks of an image, effectively acting as selective blurring of the image, aiding in the removal of additional disturbances (Nicolae et al. 2018). Another technique is Thermometer Encoding (Buckman et al. 2018), which applies a very simple transformation that involves encoding each input feature as a fixed-size binary vector. Two other proposed techniques, which are non-differentiable and inherently random, making them difficult for an attacker to bypass, are: (i) Total Variance Minimization (TVM) (Guo et al. 2018), a pre-processing technique that minimizes the total variance of the input image, in which a small set of pixels is randomly selected and reconstructed to be consistent with selected pixels. (ii) Image Quilting (Guo et al. 2018), a non-parametric method that composes images by combining image patches derived from a pre-existing dataset of clean image patches. This results in a final image containing as many unchanged pixels as possible from the adversarial version of the image. Another technique Gaussian Data Augmentation (GDA) (Zantedeschi et al. 2017), a standard data augmentation technique in computer vision. This technique adds Gaussian noise to clean samples from the original dataset, on which a classification model will be trained. This method resembles adversarial training, however it performs much better against black-box attacks (Nicolae et al. 2018) compared to them. Overall, random transformations (e.g., image cropping) and non-differentiable transformations (e.g., TVM) may often be more effective defense techniques against adversarial examples than adversarial transformations that specifically apply transformations to reconstruct adversarial examples into clean samples (Guo et al. 2018).
Spatial Smoothing (Xu et al. 2018) is a data pre-processing defense technique proposed alongside feature squeezing. It is specifically designed for use on images and can be classified under the broader category of defenses performing input transformations. Its aim is to reduce (filter) the adversarial noise present in input images through various techniques such as (non-)local spatial smoothing.
MagNet (Meng and Chen 2017) is a non-deterministic and reactive architecture designed to protect NNs against model evasion. It does not modify the model nor does it know the process of generating adversarial examples. It consists of two layers implemented using autoencoders: (i) a detector layer that rejects adversarial samples containing significant distortions and are far from the decision boundary, (ii) and a reformer layer that reconstructs images from the detector layer to remove any remaining perturbations. This architecture functions like a ’magnet’, attracting adversarial examples that escape the detection layer towards decision boundary regions corresponding to their correct classes (Machado et al. 2021). Since it does not rely on any specific process for creating adversarial examples MagNet exhibits significant generalization power and has empirically demonstrated effectiveness against most advanced attacks in black-box and gray-box scenarios, while maintaining a very low false positive rate on benign samples (Meng and Chen 2017).
GANs. Following the initial idea of using generative training for protecting models from adversarial examples (Goodfellow et al. 2015), later techniques introduced the use of GANs for defending ML models against evasion. Various defensive techniques have been proposed mainly under the data pre-processing category of defenses. The Adversarial Perturbation Elimination GAN (APE-GAN) (Jin et al. 2019) is a GAN trained to remove adversarial perturbations from images, simultaneously training a generator to produce clean images and a discriminator to distinguish between clean and adversarial images, thereby enhancing the overall system robustness. This system is capable of protecting other models from common adversarial attacks without needing to know any details about their architecture or parameters (Deng et al. 2021). The Defense-GAN (Samangouei et al. 2018) is a GAN-based defense framework trained to learn the distribution of normal images. During inference, for each input image, it finds a close output that does not contain adversarial perturbations, which is then fed into the original classifier. This method can be used with any classification model and does not modify the classifier or its training process at all. Moreover, it can serve as a defense against any attack, as it does not assume knowledge of the process for creating adversarial examples and has been empirically shown to be quite effective against a variety of attacks.
Ensemble defenses are defenses that combine two or methods sequentially or in parallel to provide improved robustness against adversarial examples with greater effectiveness (Kloukiniotis et al. 2022). Most techniques are based on the assumption that each model used compensates for the weaknesses another model may have in classifying a given input (Machado et al. 2021). However, an adaptive attacker who has knowledge of the defense being used can construct adversarial perturbations with low distortions that persist even after multiple layers of defenses, so no matter how many defenses or models are combined, if individually they offer weak defense, as a whole they will not provide adequate protection against model evasion (He et al. 2017). One such attack is Random Self-Ensemble (RSE) (Pang et al. 2019), which adds random noise to the layers of a NN to prevent strong attacks based on gradients and calculates the average of predictions obtained from models over random noise, essentially combining predictions over random noise to stabilize performance. Liu et al. (2018) proposed Adaptive Diversity Promoting (ADP) regularizers that use a variety of individual models to achieve a non-maximalist prediction, thereby encouraging diversity and leading to overall better robustness, making it difficult for adversarial examples to transfer among individual members. Additionally, MagNet can be also considered as an ensemble defense, as it uses two different autoencoders for detection and reformation of input. PixelDefend (Song et al. 2018b) is another notable example of an ensemble defense, in which an adversarial detector and an input reformer are integrated to constrain adversarial examples. It is based on the assumption that adversarial examples primarily lie in the low-probability regions of the training data distribution, enhancing the robustness of the classifier by combining adversarial detection and input reformulation in its defense strategy.
Adversarial detection techniques mainly pertain to the "runtime detection" class when categorized by mitigation approach. These techniques extend the original model by adding an additional layer or an Auxiliary Detection Model (ADM) for adversarial examples, which operates alongside the initial model, identifying whether each input is adversarial or not, and deciding whether to forward the input to the initial model or not. These defenses essentially act as filters for the original model and typically use simple binary ML classifiers to avoid imposing significant computational costs during their use or causing significant time delays in the overall model prediction or classification (Machado et al. 2021). Most detection techniques train classifiers to distinguish the representation of features of adversarial samples from normal ones (Deng et al. 2021), while other techniques apply outlier detection, introducing an extreme outlier class during the training of the model, so that the model can learn to detect adversarial examples as outlier samples (Katz et al. 2017b). There are numerous techniques in the literature for adversarial detection, which can mainly be categorized according to their training method: (i) supervised detection if adversarial examples were used during the training of the detector and (ii) unsupervised detection if only normal samples were used (Aldahdooh et al. 2022). Each of these categories can be further divided into sub-categories depending on the technique used, e.g., for supervised techniques, there are methods that are gradient-based or softmax/logits-based, while for unsupervised techniques, there are methods that use denoisers or are based on statistical properties of the samples (Aldahdooh et al. 2022). With respect to supervised methods several examples are the following: (i) Dynamic Adversary Training (DAT) (Metzen et al. 2017), which enhances classifier-based detectors with data augmentation on the output layer of the pretrained classifier. (ii) Out-of-distribution detection (Lee et al. 2018) that works equally well for detecting out-of-distribution samples and adversarial examples (Aldahdooh et al. 2022). (iii) SafetyNet (Lu et al. 2017), which uses a binary Radial Basis Function (RBF)-kernel Support Vector Machines (SVMs) classifier rather than a binary classifier. (iv) Intrinsic-Defender (I-Defender) (Zheng and Hong 2018), which tries to capture the intrinsic properties of a DNN classifier and use them to detect adversarial inputs. It also employs Gaussian Mixture Models (GMMs) to approximate the intrinsic distribution of hidden states for each class. Furthermore, MagNet and PixelDefend, which we described earlier, also include a detection component, as their first layer is responsible for detecting adversarial examples. However, research has shown that many detection techniques can be bypassed by attackers, creating new loss functions for generating adversarial examples, even if the defense technique used is not known (Carlini and Wagner 2017b). This research supports that adversarial examples are much more difficult to detect than claimed, and their properties that are considered distinct are not actually so. While they are easier to detect in simple datasets and with small distortions, in complex datasets, adversarial examples are indistinguishable from the original clean images (See Table 3).
Table 3. Listing of state-of-the-art adversarial defenses with their key characteristics
Defense | Objective | Approach | Details |
|---|---|---|---|
Adversarial Training (Goodfellow et al. 2015) | Proactive | Model hardening | Empirical robustness |
Certified Training (Raghunathan et al. 2018) | Proactive | Model hardening | Provable robustness |
DeepDefense (Yan et al. 2018) | Proactive | Model hardening | Network regularization |
DAEs (Gu and Rigazio 2015) | Proactive | Model hardening | Network regularization |
Parseval Networks (Cisse et al. 2017b) | Proactive | Model hardening | Network regularization |
Defense Distillation (Papernot et al. 2016b) | Proactive | Model hardening | Train a robust distilled model |
Label Smoothing (Warde-Farley and Goodfellow 2016) | Proactive | Model hardening | Reduce gradients |
Feature Squeezing (Xu et al. 2018) | Reactive | Pre-processing | Reduce input data dimensionality |
Spatial Smoothing (Xu et al. 2018) | Reactive | Pre-processing | Filter out adversarial noise |
Feature Denoising (Xie et al. 2019) | Reactive | Pre-processing | Suppress adversarial noise |
JPEG Compression (Das et al. 2017) | Reactive | Pre-processing | Removes adversarial noise |
Thermometer Encoding (Buckman et al. 2018) | Reactive | Pre-processing | Fixed-size binary vector encoding |
TVM (Guo et al. 2018) | Reactive | Pre-processing | Image reconstruction from set of pixels |
Image Quilting (Guo et al. 2018) | Reactive | Pre-processing | Image synthesis from patches |
GDA (Zantedeschi et al. 2017) | Reactive | Pre-processing | Gaussian noise addition |
MagNet (Meng and Chen 2017) | Reactive | Pre-processing | Detect and reform adversarial examples |
APE-GAN (Jin et al. 2019) | Reactive | Pre-processing | GANs |
Defense-GAN (Samangouei et al. 2018) | Reactive | Pre-processing | GANs |
RSE (Pang et al. 2019) | Reactive | Pre-processing | Ensemble random noise |
ADP (Liu et al. 2018) | Reactive | Pre-processing | Ensemble with diversity |
PixelDefend (Song et al. 2018b) | Reactive | Pre-processing | Ensemble & detection |
DAT (Metzen et al. 2017) | Reactive | Detection | Binary ADM |
Out-of-Distribution (Lee et al. 2018) | Reactive | Detection | Outlier detection |
SafetyNet (Lu et al. 2017) | Reactive | Detection | Binary SVM-RBF ADM |
I-Defender (Zheng and Hong 2018) | Reactive | Detection | Unsupervised detection |
Privacy protection
As illustrated in Fig. 4, privacy protection in AML safeguards two main aspects of privacy, namely model privacy and data privacy. The techniques of model privacy protection, particularly against model extraction attacks, can be categorized as follows: (i) reactive defenses, which focus on detecting ongoing or past attacks, and (ii) proactive defenses, which aim to prevent the attacks from being effective in the first place (Oliynyk et al. 2023). Techniques of proactive defenses generally involve changes to the architecture of the model, its learned parameters, the decision thresholds, or overall effectiveness in an effort to prevent model extraction. However, proactive defenses can reduce the model accuracy or usability for legitimate users. Therefore, reactive defenses are often preferred in practice because they can alert model owners to extraction attempts without negatively affecting normal operations (Oliynyk et al. 2023). Table 4 summarizes the defenses against model extraction analyzed below. To protect the privacy of data from privacy attacks on ML models, techniques have been developed, many of which originate from other fields of computer science (e.g., cryptography) or are based on traditional data protection techniques (e.g., access controls, anonymization). Generally, AI systems should be able to maintain data confidentiality from the time of collection, during storage, training, as well as model operation. To address this need, the concept of privacy-preserving ML has emerged (Al-Rubaie and Chang 2019), incorporating training data privacy preservation, and protection against model inversion and membership inference attacks.
[See PDF for image]
Fig. 4
First-level taxonomy of privacy protection methods
Proactive model privacy protection
methods do not prevent a model extraction attack entirely, but aim to reduce the quality and thus usefulness of the compromised model elements. The simplest protection method involves returning only labels as a response from ML models and not revealing additional information such as prediction probabilities or the outputs of softmax and logit layers (Tramèr et al. 2016). Similarly, there are other ways to reduce the information returned by the models while remaining useful in applications, such as not responding to incomplete queries that do not fit a specified input format (mainly for NLP applications) (Kumar et al. 2019). Other proactive methods use data perturbation to offer protection against model extraction attacks, making models return inaccurate outputs while maintaining the integrity of the prediction. Data can be perturbed at three different stages: namely model input, prediction, and/or output (Oliynyk et al. 2023). In this context, a technique that the Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al. 2017) adds input perturbations to computer vision models by adding noise to insignificant pixels. On the other hand, Tramèr et al. (2016) propose output perturbations, through rounding the predicted confidence values that models return as output since normal users do not need multiple decimal places of accuracy. This method reduces the information that can be utilized for attacks.
Watermarking (Lederer et al. 2023) can also be used proactively by embedding hidden information in the model during training, known only to legitimate owners on how to extract it. This information is usually introduced during model training, where the model learns to predict a predefined value for a sample far from the decision boundary (outlier). Hence, only someone who knows the trigger can query the model with the outlier sample and check if it produces the predefined value (Oliynyk et al. 2023).
DP, which is primarily used to protect data privacy with guarantees, can likewise be proactively employed to reduce the impact of extraction attacks on ML models. An application of this method adds perturbations to the model outputs with BDPL (Zheng et al. 2019), making the results of all samples near the decision boundary indistinguishable from each other. This method offers privacy guarantees, meaning that an attacker cannot learn the decision boundary with predetermined accuracy, regardless of how many queries are made to the model’s prediction API. Selecting the right defensive strategy for models depends on their specific goals and conditions. Monitoring malicious users may require unique features or reactive techniques like UMI, while proactive defenses and a combination of techniques enhance protection, especially when detection alone is proved slow and insufficient (Oliynyk et al. 2023).
Reactive model pracy protection
methods can be further categorized based on their objectives: (i) ownership verification to prove ownership of a stolen model using unique identifiers, and (ii) attack detection to determine if a model is under model extraction attacks by monitoring queries or inputs.
Regarding ownership verification, a key reactive approach is the Unique Model Identifier (UMI), where distinctive properties of the original model are retained and recognized in any substitute model created through extraction. These properties often stem from the model’s training data or from adversarial examples designed to transfer uniquely to extracted models (Maini et al. 2021; Lukas et al. 2021). Another method for proving ownership of a stolen model is the technique of watermarking, which is, however, primarily proactive as it is embedded during training.
With respect to attack detection, methods are usually based on detecting malicious users and queries aimed at extracting information about the model (monitor-based). One such method is PRADA (Juuti et al. 2019), which is based on the assumption that the distribution of adversarial queries significantly differs from normal queries. This defense has shown to detect all possible extraction attacks without false positives, as it makes no assumptions about training data (unlike other adversarial detection defenses) but only examines the distribution of input samples. Additionally, DefenseNet (Yu et al. 2020) and SEAT (Zhang et al. 2021b) are ML models trained to recognize if a sample is adversarial or not, which can also be used to identify model extraction attacks.
Table 4. Listing of model privacy protection techniques with their main characteristics
Defense | Objective | Approach | Details |
|---|---|---|---|
Dataset Inference (Maini et al. 2021) | Reactive | UMI | Check specific training dataset |
Conferrable AE (Lukas et al. 2021) | Reactive | UMI | Adversarial example fingerprint |
Watermarking (Lederer et al. 2023) | Reactive | Watermarking | Embed hidden information |
PRADA (Juuti et al. 2019) | Reactive | Detection-based | Distribution of queries analysis |
DefenseNet (Yu et al. 2020) | Reactive | Detection-based | Adversarial example detection |
SEAT (Zhang et al. 2021b) | Reactive | Detection-based | Adversarial example detection |
Label-only return (Tramèr et al. 2016) | Proactive | Basic defense | Prediction API minimization |
Well-formed queries (Kumar et al. 2019) | Proactive | Basic defense | Reject non-complete queries |
Grad-CAM (Selvaraju et al. 2017) | Proactive | Input perturbations | Input noise to unimportant pixels |
Confidence Rounding (Tramèr et al. 2016) | Proactive | Output perturbations | Confidence score rounding |
BDPL (Zheng et al. 2019) | Proactive | DP | DP for decision boundary |
Privacy-preserving machine learning for training data privacy
The concept of privacy-preserving ML refers to ML models that are designed to be trained without direct access to the data or to ensure that this access guarantees data protection from leaks. This initially allows multiple participants to collaboratively train their models, limiting access or sharing, and using data from multiple sources without releasing, publishing, or sharing them in their original sensitive form (Mohammad and Morris 2019).
Many of the techniques used to create privacy-preserving ML models are analyzed below. As also demonstrated in Table 5, some indicative ones are: (i) DP where noise (input perturbation) is applied to the data, offering mathematical guarantees for the privacy of individuals in a dataset without distorting the statistical properties of the original dataset (Shokri et al. 2017), (ii) HE, where data is encrypted in such a way that allows addition and multiplication (and thus more complex functions) on the encrypted data, ensuring that raw data never appears (Al-Rubaie and Chang 2019), (iii) Secure Multi-Party Computation (SMPC), where distributed data from multiple sources use computations and aggregate securely, reducing information leakage during training (Salem et al. 2019), (iv) FL, where decentralized training of local models is performed with the original raw data, avoiding data exchange and sharing, and sharing only model parameter updates, ensuring the protection of training data (Rigaki and Garcia 2023).
Table 5. Listing of privacy-preserving ML techniques with their main characteristics
Technique | Strength | Weakness |
|---|---|---|
DP | Provable guarantee of privacy | Accuracy drop |
SMPC | Computing on encrypted data | High computation cost |
HE | Encrypted training data | Numeric data only |
FL | Decentralized training | High communication cost |
The goal of all these techniques is to perform training as in the normal case, ideally with minimal impact on model performance, enhancing the protection of training data both during training and from adversarial attacks, such as in the case of DP, where models are also protected from attacks like membership inference, as they inherently protect datasets from identifying any specific sample within the set (Shokri et al. 2017). Similarly, in the case of HE, since raw data is never revealed, they cannot be reversed by the final model through model inversion attacks (Graepel et al. 2013).
In most of these techniques, there is a trade-off for protecting data privacy, which can be computational, affect model performance and accuracy, or impact sensitivity and generalization capability. As in the case of adversarial robustness, it has been proven that there is a trade-off between privacy and accuracy, which is different for each technique (Song and Mittal 2021). In the case of DP, the introduction of noise affects prediction accuracy, depending on the chosen privacy budget (Zhao et al. 2019). Similarly, in the case of HE, since each data is encrypted, its use increases computational cost and accuracy, and introduces limitations in the training design as the set of available arithmetic operations is restricted (Papernot et al. 2018b). Finally, the use of FL requires abundant computational resources and bandwidth from local devices and corresponding communication costs due to the distributed architecture of this training (Kairouz et al. 2021) (see Table 5).
Defenses against model inversion
The techniques for protecting training data of a model from model inversion attacks, which attempt to extract data from the system, are mainly classified into (i) basic techniques that modify models, control inputs, and format their outputs, and (ii) techniques that modify the data.
Specifically, for the basic protection techniques, there are many security measures that can be applied to ML systems before a query or input reaches the model. Some of these include limiting the number of queries allowed by a user to the system (rate limiting), as these attacks typically require a large number of questions to the model. Additionally, input validation is a technique that can be applied, where the format of acceptable queries is predefined, and each request to the model is checked to ensure it meets the requirements, rejecting any input that does not satisfy the conditions (Kumar et al. 2019). Similarly, for model outputs, a technique is to return only the absolutely necessary information required to make the response useful, which can be applied either by returning only predictions or even by rounding the scores produced by the model’s softmax layer (softmax score rounding). Even with rounding, confidence scores can be useful for many purposes, making the model resilient to reconstruction attacks.
For techniques that modify the data, methods like data curation can be utilized before training, to initially avoid including sensitive personal information in the dataset (Shokri et al. 2017). Furthermore, it is important to remove multiple copies or duplicate data that may exist (data deduplication), especially in the case of generative models (e.g., GPT-2, GPT-3, Stable Diffusion) which tend to memorize entire training data, whether text sentences or entire images (Carlini et al. 2023). For example, in a diffusion model trained on CIFAR-10, when trained on the same dataset with all duplicate images removed (similarity over 85%), the new dataset has 10.55% fewer samples (44,725 vs. 50,000), and the model reproduces 23% fewer examples compared to the original (986 vs. 1280).
The aforementioned techniques are a first line of defense and are good practices for better data security, but they do not offer absolute prevention against privacy leaks. To achieve this, stricter methods are required, which may have implications for other model functions. One such method is DP (Dwork 2006), to add noise and hide sensitive information, providing strong privacy guarantees for individual data in the dataset. This technique is applied to ML models, especially NNs, during training through the Differentially Private Stochastic Gradient Descent (DP-SGD) algorithm, where model gradients are limited and noise is added to prevent the leakage of significant information about the presence of any single input in the dataset (Carlini et al. 2023). However, due to the trade-off between accuracy and privacy (the more noise, the greater the accuracy drop (AD) (He et al. 2019)) and because it increases training time, it is avoided in very large models like LLMs or Diffusion Models, where their accuracy is primarily constrained by the cost of training.
Another technique that protects the original data is HE, which prevents sensitive information from leaking by encrypting the data and performing computations directly on the encrypted input. However, a disadvantage of this encryption is that it suffers from significant inefficiencies and does not apply to all functions of a NN (He et al. 2019).
Table 6 lists the defenses for protecting the training data of ML models from model inversion inversion attacks, accompanied by their main characteristics.
Table 6. Listing of privacy protection techniques against model inversion attacks
Defense | Approach | Details |
|---|---|---|
Rate Limiting (Kumar et al. 2019) | Basic Input/Output defense | API queries rate limiting |
Input Validation (Kumar et al. 2019) | Basic Input/Output defense | Reject non-valid input |
Softmax Score Rounding (Kumar et al. 2019) | Basic Input/Output defense | Rounding of confidence scores |
Data Curation (Shokri et al. 2017) | Data Pre-processing | Filtering of sensitive data |
Data Deduplication (Carlini et al. 2023) | Data Pre-processing | Delete duplicate data |
DP-SGD (He et al. 2019) | DP | Provable guarantee of privacy |
HE (He et al. 2019) | HE | Computing on encrypted data |
Defenses against membership inference
The techniques for protecting the training data of a model from membership inference attacks can be categorized into four categories: (i) confidence score masking, (ii) regularization, (iii) knowledge distillation, and (iv) DP (Hu et al. 2022).
Confidence score masking techniques aim to conceal the actual confidence scores returned by classifiers to mitigate the effectiveness of membership inference attacks. One such method is restricting the probability vector to the k most probable classes (top-k classes restriction) (Shokri et al. 2017), where when the number of classes is large, many classes may have small probabilities in the model’s prediction vector, and thus a filter is added to limit the information leaked by the model. In the most extreme case of this technique, the model only returns the label of the most probable class without reporting its probability (prediction label only) (Shokri et al. 2017), offering the most limited knowledge to attackers but remaining vulnerable to strong attacks (Hu et al. 2022). Another category of techniques attempts to round the values of the probability vectors to d floating points (prediction precision rounding) (Shokri et al. 2017), where the smaller the d, the less information is leaked by the model. Finally, another category of techniques adds carefully crafted noise to the prediction vector to obscure the actual scores. One such technique is MemGuard (Jia et al. 2019), which uses AML methods to transform this noisy vector into an adversarial example so that attackers cannot determine whether a sample is a member or not based on this output. This method does not require retraining the original model, nor does it affect its accuracy, but while it can mitigate the effect of membership inference attacks, it remains vulnerable to them (Song and Mittal 2021). Generally, all the above techniques do not require retraining the model, do not affect the model’s accuracy, only the prediction vectors, and are easy to implement, but they cannot be the sole methods of model protection as they do not offer any minimal privacy guarantees (Hu et al. 2022).
Regularization techniques aim to reduce the degree of overfitting of models, which consequently mitigates membership inference attacks. There are many classic regularization methods that can be utilized, such as L2-norm Regularization, where large parameters are penalized during training (Shokri et al. 2017), Dropout, where a predetermined amount of neurons are randomly discarded during training, and Model Stacking, where multiple models trained separately are combined (Salem et al. 2019), as well as others like Label Smoothing, Early Stopping, and Data Augmentation (Hu et al. 2022). We have already mentioned most of these as techniques that can protect models from adversarial examples, as they improve the generalization of models, forcing them to produce similar output distributions for members or non-members of the training data. Regularization techniques need to be applied during model training, changing their internal parameters, impacting the models’ accuracy, and may not provide satisfactory privacy protection compared to other methods (Song and Mittal 2021).
Knowledge distillation techniques use the outputs of a large model to train a smaller model, allowing the knowledge to be transferred from the large model to the small one, enabling it to have similar accuracy to the large one (Hu et al. 2022). One such technique is Distillation For Membership Privacy (DMP) (Shejwalkar and Houmansadr 2021), where a private training dataset and a dataset without labels are needed. First, a "large" model is trained with the private data to label the unlabeled dataset, and then a "small" model is trained based on the new dataset. The idea behind this technique is that the new classifier does not have direct access to the training data of the original model, significantly reducing information leakage of membership.
DP is a probabilistic privacy mechanism that provides a theoretical privacy guarantee for data, which can be applied to ML models to protect against membership inference attacks. When an ML model is trained with DP, the model neither learns nor remembers the details of specific users if the privacy budget is sufficiently small. By definition, differentially private models naturally limit the success probability of membership inference attacks, and even with the use of auxiliary information, the anonymity of the data is maintained, and an attacker cannot increase the privacy loss Al-(Rubaie and Chang 2019). The most common way to train differentially private ML models is through the DP-SGD algorithm (Abadi et al. 2016), which introduces noise to the model’s gradients during training. However, as previously mentioned, while it is a technique that offers privacy guarantees, there is an inherent trade-off with the accuracy of these models, which is significantly sacrificed for privacy protection.
Defenses against poisoning attacks
In this section we describe the defenses against the two main categories of poisoning attacks, namely model poisoning and data poisoning.
Data poisoning attacks
Data poisoning attacks attempt to degrade a model’s performance by manipulating the training data. A common countermeasure against them is data sanitization, which uses anomaly detectors to exclude suspicious training data (Tian et al. 2022b). Techniques such as nearest neighbor analysis (Paudice et al. 2018) and clustering in feature or activation spaces (Chen et al. 2018a) are often used to discover anomalies or incorrect data points that may have been tampered. As an example, Cretu et al. (2008) proposes a sanitization approach using "micro-models" trained on specific small, attack-free portions of the data, significantly improving content-based anomaly detection systems, reaching up to five times higher detection rates over unsanitized models.
Moreover, data augmentation can also be used to make it harder for poisoned data to significantly affect model performance (Borgnia et al. 2020). Additionally, loss correction/optimization techniques can help mitigate the influence of compromised data by modifying the training loss function. These methods recalibrate the contributions of different samples based on their likelihood of being poisoned, thus increasing the model’s robustness to adversarial data (Tian et al. 2022b; Hendrycks et al. 2018).
Model poisoning attacks
For model poisoning attacks, defenses are more complicated, as in this case different aspects of the model can be manipulated. These defenses aim to secure the ML models against malicious changes that will affect their functioning. Such a defense against model poisoning attacks in centralized learning is model sanitization, where defenders manipulate model parameters directly to make an update resistant to the poison. This can include techniques such as fine-pruning (Schuster et al. 2021) that removes dormant neurons which may have been targeted by adversaries, followed by restoring its utility through fine-tuning on a clean surrogate dataset.
In the context of FL, model sanitization is more challenging due to the decentralized nature of the training process (Wang et al. 2022). In this setting, defenders need to look at the local model updates from client devices and identify and remove malicious contributions. By comparing potentially malicious updates with benign ones, techniques such as client-side cross-validation (Zhao et al. 2021) and spectral anomaly detection (Li et al. 2020d) identify these attacks in order to only aggregate trusted models at the central server.
Additionally, model-driven countermeasures focus on modifying the training process itself, such as adjusting the learning algorithm to increase robustness against poisoned data (Tian et al. 2022b). Examples include SloppySVM Stempfel and (Ralaivola 2009), which combats label flipping in SVMs and other methods that modify loss functions to correct for noise in the data.
Defense Distillation (see Sect. 2.2.1) is another technique to protect against model poisoning (Hinton et al. 2015a), which is primarily used for mitigating backdoor attacks. For example, Yoshida and Fujino (2020) propose an enhanced defense mechanism against model poisoning attacks by integrating knowledge distillation with a novel data detoxification step to effectively remove poisoned data and maintain high model accuracy in image classification tasks. However, as poisoning attacks are highly diverse and sophisticated, designing universal defenses that can withstand a wide range of attacks remains a critical research area, with approaches focusing on eliminating outliers or minimizing loss margins to enhance resilience.
Note here that in the broader scope of data security, particularly regarding confidentiality and integrity, poisoning attacks can and should be primarily handled through traditional data and system security techniques that do no relate with robust ML techniques. Such techniques entail robust identity and access management (Kormpakis et al. 2023, 2022), encryption, and data leakage prevention tools, conducting regular audits, and employing data integrity checks such as hash functions. Additionally, adopting immutable storage solutions, establishing comprehensive backup and recovery plans, providing cybersecurity training, and ensuring compliance with relevant regulations are essential steps to safeguard sensitive information against unauthorized access and ensure its accuracy throughout its lifecycle (Nadji 2024).
Benchmarks and open-source technologies for robust machine learning
In this section we investigate the set of evaluation tools including official benchmarks for measuring adversarial robustness. Additionally we present the current state-of-the-art open-source technologies and libraries for developing and evaluating ML robustness and privacy. The aim of this section is to provide the reader with a
Robustness benchmarking
In the context of AML, evaluating the robustness of models against adversarial attacks is critical for ensuring their reliability. To evaluate model robustness, it is of utmost importance to first define the respective threat model and then follow a specific methodology and set of principles that include: (i) defense against an adversary attacking the system to proceed with; (ii) testing robustness under worst-case scenarios; and (iii) measuring AI progress relative to human capabilities. A systematic approach to these evaluation methodologies has been provided in two recent works by Carlini et al. (Carlini and Wagner 2017a; Carlini et al. 2019). In this section, we provide (i) a set of common metrics for evaluating robust ML systems, (ii) a series of robustness benchmarks commonly used in the literature to comparatively assess adversarial robustness solutions, and (iii) a brief presentation of the predominant tradeoffs that come with the development of robust ML algorithms.
Universal robustness measures
One way to assess the vulnerability of ML models under adversarial conditions is through robustness measures, which quantify the sensitivity of a model’s output to changes in its inputs (Nicolae et al. 2018). Specifically, for evasion attacks, this could refer to the quantification of the magnitude of perturbation required to cause a misclassification. Currently, various measures are considered in the literature. However, only a small subset of these measures is universally applicable to all types of attacks, models, or perturbation sizes, and can be used to evaluate the robustness of a defense technique or model in a generalizable fashion. To assess adversarial robustness across diverse attack scenarios, evaluations must be performed under consistent parameters and within the same threat model to offer meaningful comparisons. These universal measures are described below. In addition to these, research has been conducted on the evaluation of robustness in specific domains, which will be presented in Sect. 4.
Adversarial attack success rate
The adversarial Attack Success Rate (ASR) is defined as the percentage of adversarial examples for which the model provided an incorrect response. This can be formalized as follows:where is the classifier being evaluated, represents the adversarial version of the input , is the set of input samples, is the index set of input samples in , is an indicator function that returns 1 if the condition inside is true (i.e. if the classifier misclassifies the adversarial input). To properly evaluate gradient-based attacks, the ASR is typically compared under the same perturbation size across different attacks (Wu et al. 2021). The lower this percentage, the more robust the evaluated model is.
Empirical Robustness
The Empirical Robustness is defined as the average minimum perturbation an attacker needs to introduce for a successful attack. This metric evaluates the robustness of a classifier with respect to a specific set of attack and test data and therefore cannot be used for comparison between different attacks or perturbation sizes (Moosavi-Dezfooli et al. 2016). Given a trained classifier C(x), a non-targeted attack rho(x), and a set of test data samples , let I be the subset of indices for which , meaning that the attack was successful. Then, Empirical Robustness (ER) is defined as follows (Nicolae et al. 2018):where p is the norm used in the creation of adversarial samples (if applicable), and the usual default value is .
Local loss sensitivity
The Local Loss Sensitivity measures the largest variation in a function under a small change in its input. A smaller value indicates a smoother function. It aims to quantify the smoothness of a model by estimating its Lipschitz continuity constant. Originally introduced to assess how real and random inputs influence a model’s average loss, this metric serves as an attack-independent measure of a model’s properties and memorization levels (Arpit et al. 2017). It quantifies the average sensitivity of the model’s loss function to input variations, thereby providing a robust evaluation of model behavior independent of specific attack strategies. Given a trained classifier C(x) and a set of test data samples , Local Loss Sensitivity (LLS) is defined as follows (Nicolae et al. 2018):Essentially, Local Loss Sensitivity calculates the average sensitivity of the model’s loss function with respect to changes in the inputs (Nicolae et al. 2018).
CLEVER score
The Cross Lipschitz Extreme Value for Network Robustness (CLEVER) metric estimates, for a given input x and norm, a lower bound for the minimum perturbation required to change the classification of x, i.e., implies (Weng et al. 2018). Since there is generally no closed-form expression or upper bounds for the Lipschitz constant, the CLEVER algorithm uses an estimate based on extreme value theory (Nicolae et al. 2018). Essentially, CLEVER provides a lower bound on the minimum perturbation required for misclassification. The CLEVER score evaluates network robustness in a white-box setting, is attack-agnostic, and remains computationally efficient for DNNs. It can be computed for both targeted and non-targeted attacks. It supports robustness assessment under a wide range of distortions, focusing primarily on the widely used and norms. Notably, CLEVER remains effective even in the presence of gradient masking, a defensive technique that obscures model gradients to deter gradient-based attacks.
AutoAttack benchmark
Description of the framework
In the domain of empirical robustness, one of the first efforts for the evaluation of defense techniques is AutoAttack (Croce and Hein 2020b). In this work, more than 50 models were selected from published papers in top AI and computer vision conferences. A series of different attacks (targeted, untargeted) were executed on various datasets and conditions (, ). In these evaluations, for white-box attacks that were run more than once, all models exhibited lower robust accuracy than reported in publications, with more than 13 models showing a difference exceeding 10%. While AutoAttack may not be the ultimate adversarial attack required to evaluate any model’s robustness, it is considered the minimal evaluation needed for any new model or proposed defense, as it is effective and provides a solid initial assessment of empirical robustness.
Threat model
The set of attacks used includes a variety of white-box and black-box attacks, which are efficient and parameter-free, making AutoAttack a reliable, fast, and automatic way to evaluate robustness. Specifically, two of the attacks are based on an improved version of the Projected Gradient Descent attack (Madry et al. 2018), called Auto-PGD (APGD), which features an optimized step size and is budget-aware, meaning it is unaware of any loss decrease with each iteration. These two attacks (, ) are simply two variations of the same technique, differing in their loss functions. The other two white-box attacks are based on the Fast Adaptive Boundary (FAB) attack (Croce and Hein 2020a), with both a targeted () and untargeted (FAB) versions. The black-box attack used is the Square Attack (Andriushchenko et al. 2020). Throughout this benchmark, models were tested on both and attack models across the MNIST, CIFAR-10, CIFAR-100, ImageNet datasets. Additionally, a distinction is made between deterministic (with threshold ) and randomized defenses (with a stochastic element) in the results.
Results
The authors of AutoAttack report the clean accuracy and the robust accuracy results for each tested attack–as introduced by their seminal works– and compare them against their own results which occur as a combined robust accuracy score of all the other attacks. These results are presented in Table 7. In almost all cases, AutoAttack reports a lower robust accuracy than the published values, often by as much as 10%. This suggests that AutoAttack provides a more reliable empirical measurement, as its robust accuracy is derived from multiple attacks. It should be noted that the robust accuracy of the Square attack is significantly higher (by up to 10%) compared to other attacks, confirming that black-box attacks are generally easier to defend against. Another interesting observation lies in the difference between clean and robust accuracy, which exceeds 30% in most cases. This is a universal phenomenon in adversarial robustness, where robustness accuracy is generally significantly lower than accuracy on clean samples. The exception again is with simpler datasets (such as MNIST), where it is easier to achieve high robust accuracy, thus approaching the clean accuracy.
Table 7. Evaluation of adversarial robustness (%) using AutoAttack across datasets (CIFAR-10, CIFAR-100, MNIST, ImageNet) under various perturbations (Croce and Hein 2020b)
Dataset | Method | Clean | APGDCE | APGDDLR | FABT | Square | AutoAttack | Reported | Reduction |
|---|---|---|---|---|---|---|---|---|---|
CIFAR-10 () | Carmon et al. (2019) | 89.69 | 61.74 | 59.54 | 60.12 | 60.16 | 59.53 | 62.5 | 2.97 |
Alayrac et al. (2019) | 86.46 | 60.17 | 56.67 | 57.27 | 56.72 | 55.93 | 60.2 | 4.27 | |
Hendrycks et al. (2019) | 87.11 | 57.23 | 54.94 | 56.12 | 53.94 | 52.42 | 57.4 | 4.48 | |
Rice et al. (2020) | 91.09 | 62.17 | 59.27 | 60.34 | 58.34 | 57.54 | 61.8 | 4.28 | |
Qin et al. (2019) | 86.28 | 55.70 | 52.32 | 55.31 | 51.25 | 50.82 | 52.81 | 4.03 | |
CIFAR-10 () | Zhang et al. (2019a) | 84.92 | 55.28 | 53.10 | 53.45 | 47.99 | 46.34 | 50.43 | 3.35 |
Atzmon et al. (2019) | 81.30 | 41.16 | 40.14 | 40.37 | 39.02 | 42.17 | 46.13 | 2.95 | |
Xiao et al. (2019) | 79.28 | 33.02 | 31.54 | 34.78 | 29.98 | 30.12 | 32.9 | 3.03 | |
CIFAR-100 () | Hendrycks et al. (2019) | 59.23 | 33.02 | 28.48 | 34.72 | 34.26 | 28.42 | 33.5 | 5.08 |
Rice et al. (2020) | 53.83 | 25.18 | 19.24 | 24.34 | 23.57 | 18.95 | 28.1 | 9.15 | |
MNIST () | Zhang et al. (2020) | 98.38 | 95.32 | 94.12 | 95.56 | 93.65 | 92.58 | 94.8 | 2.42 |
Gowal et al. (2018) | 98.34 | 95.44 | 93.88 | 95.67 | 94.23 | 92.74 | 94.79 | 3.05 | |
Zhang et al. (2019a) | 98.45 | 94.85 | 93.45 | 95.36 | 93.22 | 92.41 | 94.2 | 1.79 | |
Ding et al. (2018) | 98.95 | 94.55 | 93.12 | 95.09 | 92.47 | 91.34 | 94.0 | 2.31 | |
Atzmon et al. (2019) | 98.35 | 92.30 | 90.24 | 91.78 | 90.12 | 88.34 | 91.5 | 4.01 |
RobustBench benchmark
Description of the framework
In the field of empirical robustness, RobustBench (Croce et al. 2021) is essentially a continuation of AutoAttack, as it uses the same attacks to evaluate the robustness of models, but extends this approach by providing standardized robustness evaluations for pre-trained models on public datasets. Robustbench also ensures worst-case robustness evaluation by using an ensemble of AutoAttack attacks and carefully chosen perturbation budgets. The goal is to maintain an updated list, in the form of a leaderboard, of the best adversarial defenses in terms of robust accuracy. To achieve this the authors have turned the benchmark into an open-source library (Croce et al. 2020), which anyone can use to evaluate their model. Independent evaluation and the publication of results are encouraged to enrich the leaderboard.
For independent evaluations by the community, RobustBench has built upon AutoAttack by encouraging stronger white-box and black-box evaluations, including adaptive attacks, to ensure realistic assessments. Furthermore, it has expanded the scope of perturbations, going beyond , , to include Common Image Corruptions (Hendrycks and Dietterich 2019), which, like adversarial perturbations, are designed to avoid altering the model’s decisions.
Threat model
For model evaluation, the four attacks mentioned in AutoAttack are primarily used, due to their variety and ease of use (no hyperparameter tuning needed). For this benchmark to be effecive, models or defense techniques must satisfy certain conditions to ensure meaningful conclusions. These conditions are as follows: (i) the models should not have zero gradients with respect to the inputs, as most attacks are gradient-based; (ii) they must have a fully deterministic forward pass, meaning they should not contain stochastic elements, as while these may increase robustness, they make standardizing evaluation more difficult; and (iii) they should not have an optimization loop, as this makes the backpropagation process extremely expensive, and thus the evaluation. Defenses that violate these three principles often only make gradient-based attacks more difficult but do not actually improve robustness (Carlini et al. 2019), except for those that can provide certified robustness (Cohen et al. 2019b). In these evaluations, models are tested under , , and Common Image Corruptions threat models, using the CIFAR-10, CIFAR-100, and ImageNet datasets.
Results
In Table 8, derived from Robust (Bench 2024), we can see, starting from the first column, the publication name for each model, the clean accuracy, the robust accuracy results, which are derived by combining all the attacks and the model’s architecture. RobustBench follows a similar presentation of results to AutoAttack. It is worth noting that the conclusions drawn from AutoAttack also apply here. The difference between clean and robust accuracy is equally large, around 20% to 30%, depending on the dataset and perturbation. Moreover, robust accuracy is higher on the "easier" datasets, and similar conclusions can be drawn about the differences between perturbations for each threat model.
Table 8. Comparative evaluation table of adversarial robustness for CIFAR-10, CIFAR-100, and ImageNet datasets, and perturbations (, , using RobustBench
Dataset | Method | Clean accuracy (%) | Robust accuracy (%) | Architecture |
|---|---|---|---|---|
CIFAR-10, | Bartoldson et al. (2024) | 93.68 | 73.71 | WRN-94-16 |
Amini et al. (2024) | 93.24 | 72.08 | MeanSparse WRN-70-16 | |
Bartoldson et al. (2024) | 93.11 | 71.59 | WRN-82-8 | |
Peng et al. (2023) | 93.27 | 71.07 | RaWRN-70-16 | |
Wang et al. (2023c) | 93.25 | 70.69 | WRN-70-16 | |
CIFAR-10, | Wang et al. (2023c) | 95.54 | 84.97 | WRN-70-16 |
Wang et al. (2023c) | 95.16 | 83.68 | WRN-28-10 | |
Rebuffi et al. (2021b) | 95.74 | 82.32 | WRN-70-16 | |
Gowal et al. (2021a) | 94.74 | 80.53 | WRN-70-16 | |
Rebuffi et al. (2021a) | 92.41 | 80.42 | WRN-70-16 | |
CIFAR-100, | Wang et al. (2023c) | 75.22 | 42.67 | WRN-70-16 |
Bai et al. (2024b) | 83.08 | 41.91 | RN-152 + WRN-70-16 | |
Cui et al. (2024) | 73.85 | 39.18 | WRN-28-10 | |
Wang et al. (2023c) | 72.58 | 38.83 | WRN-28-10 | |
Bai et al. (2024a) | 85.21 | 38.72 | RN-152 + WRN-70-16 + mixing network | |
ImageNet, | Amini et al. (2024) | 77.96 | 59.64 | MeanSparse ConvNeXt-L |
Liu et al. (2023) | 78.92 | 59.56 | Swin-L | |
Bai et al. (2024b) | 81.48 | 58.62 | ConvNeXtV2-L, Swin-L | |
Liu et al. (2023) | 78.02 | 58.48 | ConvNeXt-L | |
Singh et al. (2023) | 77.00 | 57.70 | ConvNeXt-L, ConvStem |
SoK benchmark
Description of the framework
In the field of certified robustness, SoK (Certified Robustness for Deep Neural Networks) (Li et al. 2023b) is a benchmark that evaluates models and defense techniques in terms of their certified robustness. The SoK benchmark has been implemented as an open-source technology (see VeriGauge (Li et al. 2020c)), aiming to provide fair comparisons of certified robustness and robust verification approaches. In addition to the benchmark, there is also a leaderboard showcasing the state-of-the-art published techniques and models with certified robustness. The SoK leaderboard primarily reflects the progress achieved in models with certified robustness, and aims to maintain and expand by including new models that will emerge in the future with better performance than the existing ones. For this reason, anyone who publishes a new model or technique and can evaluate its certified robustness is encouraged to submit results and contribute to the corresponding website (SoK 2023).
Threat model
For verifying robustness, techniques are divided into deterministic approaches, where the certified robustness accuracy is compared across different models and scales, and probabilistic approaches, where the highest certified robustness accuracy achieved collectively is compared. For probabilistic approaches, only those with certification confidence are considered.
To assess certified robustness, the certified accuracy is used, which is calculated as the fraction of test samples confirmed to be robust against the specified perturbations:
1
For evaluating deterministic verification approaches, 17 different techniques are used, including both complete and incomplete verification. These involve 3 fully connected NNs (FCNNa, FCNNb, FCNNc) and 4 Convolutional Neural Networks (CNNa, CNNb, CNNc, CNNd), 2 datasets (MNIST, CIFAR-10) and attacks . For evaluating probabilistic verification approaches, 4 verification and 5 robust training approaches are used. These include 3 ResNet models (ResNet-110, Wide ResNet 40-2, ResNet-50), 2 datasets (CIFAR-10, ImageNet) and attacks , , .Results
Based on the results of the benchmark, deterministic verification approaches have proven their effectiveness for small models (e.g., FCNNa, FCNNb), and complete verification approaches can effectively verify robustness, thus representing the best choice. However, in larger models (e.g., CNNb, CNNc, CNNd), they yield almost zero certified accuracy, and therefore incomplete verification approaches are more suitable for archieving better results (e.g. linear relaxation). For probabilistic verification approaches, adversarial training achieves the best certified accuracy among robust training techniques. Also, for and attacks, the approaches based on the Neyman-Pearson method (Randomized Smoothing) (Cohen et al. 2019b) achieve the highest certified robustness.
It is particularly interesting to investigate the best performances in certified robustness presented on the SoK leaderboard (Sok 2024) from each dataset and for each threat model we are examining, up to the present date (see Table 9).
Table 9. Comparative table (SoK Leaderboard) of the top 5 models and certified robustness methods for the MNIST, CIFAR-10, and ImageNet datasets under perturbations for various values of (Li et al. 2023b)
Dataset | Method | Certified accuracy (%) | Certification type |
|---|---|---|---|
MNIST,, | Palma et al. (2024) | 98.43 | Deterministic |
Muller et al. (2023) | 98.22 | Deterministic | |
Zhang et al. (2022c) | 98.14 | Deterministic | |
Shi et al. (2021) | 97.95 | Deterministic | |
Zhang et al. (2022b) | 97.95 | Deterministic | |
MNIST,, | Lyu et al. (2021) | 94.02 | Deterministic |
Zhang et al. (2022c) | 93.40 | Deterministic | |
Muller et al. (2023) | 93.40 | Deterministic | |
Zhang et al. (2022b) | 93.20 | Deterministic | |
Shi et al. (2021) | 93.10 | Deterministic | |
CIFAR-10,, | Salman et al. (2019) | 68.2 | Probabilistic |
Carmon et al. (2022) | 63.8 | Probabilistic | |
Palma et al. (2024) | 63.78 | Deterministic | |
Muller et al. (2023) | 62.84 | Deterministic | |
Palma et al. (2023) | 61.97 | Deterministic | |
CIFAR-10,, | Altstidl et al. (2023) | 41.78 | Deterministic |
Zhang et al. (2022c) | 40.39 | Deterministic | |
Zhang et al. (2022b) | 40.06 | Deterministic | |
Palma et al. (2024) | 35.44 | Deterministic | |
Zhang et al. (2021a) | 35.42 | Deterministic | |
ImageNet,, | Salman et al. (2019) | 38.2 | Probabilistic |
Cohen et al. (2019a) | 28.6 | Probabilistic |
These results highlight the limitations of existing models on large datasets. For the MNIST dataset, there is an upward trend, although it began with very high accuracy values. For CIFAR-10 with , there is also an upward trend, with an overall increase of nearly 20% over the years. Finally, for ImageNet, there are few methods, and the performance remains low, below 40%.
Trade-offs in model robustness
As aforementioned, in the effort to enhance the robustness of a model, other parameters of the model may be negatively affected or certain undesirable side effects may arise. Thus, there are specific trade-offs that need to be considered when using techniques to enhance the robustness of a model.
Accuracy
In some cases, training and designing an ML system with safety in mind may conflict with the goal of achieving high accuracy. Certain techniques and methods that enhance ML models robustness (and thus safety) may lead to a loss in the model’s normal accuracy. Although theoretical evidence suggests that there is no inherent trade-off between robustness and accuracy, in practice, the available methods fail to achieve improvements in both metrics (Yang et al. 2020). Below, we attempt to revisit both theoretical and empirical perspectives as they appear in the current literature.
Theoretical perspective. One of the most effective ways to achieve robustness in an ML model is through adversarial training. However, it has been observed that simultaneously increases the standard error (error on clean, undisturbed inputs), resulting in decreased standard accuracy of the model and worsening its generalization ability (Raghunathan et al. 2020)
This phenomenon is often explained by the opposing goals of standard performance and adversarial robustness, due to the different features learned by the model. Adversarial training reduces reliance on non-robust features, which might still be useful for general predictions (Clarysse et al. 2023). As a result, there is an inherent trade-off between standard and robust accuracy. This trade-off is not merely a side effect of adversarial training but has been demonstrated even under simpler conditions like binary classifiers (Tsipras et al. 2019) and is seen empirically in larger models (see below). Certified robustness methods also encounter this trade-off, as they provide robustness guarantees but often at the cost of performance on non-adversarial data (Zhang et al. 2019b).
However, certain research works (Tsipras et al. 2019; Pang et al. 2022) suggest that robustness and accuracy are not inherently conflicting and can be achieved together under certain conditions, though current techniques still fall short. This discrepancy may stem from two limitations: (i) failure to enforce local Lipschitz properties, or (ii) insufficient generalization. For instance, in natural datasets, a robust and accurate classifier may exist, achieved by rounding a local Lipschitz function (Yang et al. 2020).
Another view suggests that with more training data, the trade-off between standard and robust accuracy diminishes. Adversarial training requires substantial data, and empirical results support the idea that additional samples can mitigate this trade-off (Raghunathan et al. 2019). However, even with infinite data, the trade-off theoretically persists, suggesting it may stem from the data distribution rather than its quantity (Tsipras et al. 2019).
Empirical perspective. While the theory of robustness offers valuable insights for enhancing the resilience of ML models, empirical evidence reveals challenges in balancing standard accuracy and robust accuracy. In practice, training a model for high robust accuracy often comes at the expense of standard accuracy, with the extent of this trade-off varies depending on the dataset and techniques used.
In specific cases, robustness and accuracy may coexist and this is not always obvious. This suggests the presence of inherent mechanisms that could allow both robustness and accuracy to be achieved, which is an area that requires further research.
Computational cost
Increasing robustness typically raises the computational cost, whether for training robust models or verifying their robustness. Since adversarial training requires more data to improve accuracy, it becomes computationally more expensive than standard training (Zhao et al. 2022). Raghunathan et al. (2020) demonstrated that to reduce the gap in standard error (and increase clean accuracy) of an adversarially trained model on CIFAR-10, a growing amount of data is required, potentially approaching infinite data to reach the original clean accuracy. However, training with infinite data is computationally infeasible with current methods and resources (Tsipras et al. 2019).
Additionally, the number and type of perturbations used during training also impact computational cost. Robust training typically focuses on specific -bounded perturbations, providing robustness only for those types, without guarantees for unseen adversarial examples with different types of disturbances (hidden adversarial samples) (Zhao et al. 2022). Research on extending robustness to multiple perturbations is limited, and while small-scale implementations exist, the problem remains computationally difficult at a larger scale (Tramer and Boneh 2019).
In the domain of robustness verification, there is also a trade-off between computational cost and guaranteed robustness. As model complexity and input space grow, finding tight bounds becomes increasingly difficult (Anonymous 2022). Verifying robustness, even for simple -bounded perturbations, is computationally challenging, and precise methods struggle to scale for large models (Katz et al. 2017a).
Time
Correspondingly, with the increase in computational cost, in the effort to increase the robustness of a model, the time for robust training typically increases as well. This is one of the disadvantages of adversarial training as new perturbations must be computed at each parameter update step (Tsipras et al. 2019). For instance, using the PGD algorithm requires at least 10 additional iterations to obtain good adversarial examples, resulting in approximately a 10-fold slowdown compared to standard training (Li 2019).
Furthermore, observing the robust training times for 4 different perturbation cases on datasets from RobustBench (Croce et al. 2021), we notice that the most robust models tend to require significantly more time (see Table 10). While the training time is also affected by the size and complexity of the model and the data, the most robust model for on CIFAR-10 has a robust accuracy of 78.80% and a training time of just over 15 h. In contrast, the model with the lowest training time has the lowest robust accuracy (25.31%) for on ImageNet, with a training time close to 1.5 h.
Table 10. Robustness benchmarks on different datasets with AutoAttack
Dataset | Leaderboard | Paper | Architecture | Clean Acc. (%) | Robust Acc. | Time |
|---|---|---|---|---|---|---|
CIFAR-10 | Gowal et al. (2021b) | WRN-28-10 | 89.48 | 62.82% ± 0.016 | 11.8 h | |
CIFAR-10 | (Rebuffi et al. 2021a) | WRN-28-10 | 91.79 | 78.80% ± 0.000 | 15.1 h | |
CIFAR-100 | Wu et al. (2020b) | WRN-34-10 | 60.38 | 28.84% ± 0.018 | 6.6 h | |
ImageNet | Salman et al. (2020) | ResNet-18 | 52.92 | 25.31% ± 0.010 | 1.6 h |
Nevertheless, there are implementations that calculate the training time component and optimize training time. Some methods use modified versions of PGD or FGSM to achieve this goal, but at the risk of producing weaker adversarial samples and therefore training a less robust model (Zhao et al. 2022).
Open-source technologies and tools for adversarial and robust machine learning
Robustness
In Sect. 3.1.1, we referenced two main benchmarking methodologies regarding empirical and certified robustness, both of which are implemented through open-source software. This section, however, presents a series of additional open-source technologies that can be used for training NNs to increase robustness, as well as to evaluate it through adversarial attacks and defenses and measurements. These tools and libraries typically implement the most effective adversarial defenses for training models or for integrating them into models if they are techniques that do not require retraining the model, the most powerful adversarial attacks for testing empirical robustness against models, defenses, as well as measurements for the comparative evaluation of robustness.
The Adversarial Robustness Toolbox (ART) (Nicolae et al. 2018) is a Python library that provides tools for the evaluation, defense, and verification of ML models and applications against adversarial attacks. It can be used by blue teams and red teams to create more secure models and test their security, respectively. It supports various ML frameworks, most types of data, and many ML tasks (classification, object detection, speech recognition, generation, certification). In addition, it includes various types and numerous attacks, such as evasion white-box attacks (e.g., FGSM, PGD, DeepFool, JSMA), evasion black-box attacks (e.g., square attack, pixel attack), poisoning attacks (e.g., backdoor attack). It also includes a wide array of defenses, such as pre-processing techniques (e.g., JPEG compression, label smoothing, PixelDefend), training techniques (e.g., adversarial training, certified adversarial training), and detection methods (e.g., basic detector based on inputs, detection based on activations analysis). Finally, it entails numerous metrics (e.g., empirical robustness, loss sensitivity, CLEVER) for evaluating robustness.
CleverHans (Papernot et al. 2018a) is a Python library for conducting benchmarks to asses the robustness of ML systems against adversarial examples. It includes standardized implementations of adversarial attacks and adversarial training techniques and is primarily intended for developing more robust ML models by providing standardized benchmarks for model performance in adversarial environments. Without standardized and well-implemented adversarial attacks, an insecure model can appear to perform well. CleverHans (from version v4.0.0 and later) supports three popular ML frameworks (TensorFlow 2, PyTorch, JAX) and includes a wide range of state-of-the-art attacks, mainly evasion white-box attacks (e.g., FGSM, PGD, C&W), but includes limited defense techniques. It also provides numerous examples and tutorials.
Foolbox (Rauber et al. 2018) is a Python library that simplifies the process of executing adversarial attacks against ML models to comparatively evaluate their robustness. It is framework-agnostic, meaning it is a suitable tool for comparing the robustness of many different models implemented in different ML frameworks. Since version 3 (known as Foolbox Native), Foolbox supports three of the most popular ML frameworks (TensorFlow 2, PyTorch, JAX), primarily supporting ML classification processes in image applications. The library includes a comprehensive collection of state-of-the-art attacks, mainly gradient-based attacks (e.g., FGSM, PGD, EAD, DeepFool) and decision-based attacks (e.g., BA, PointwiseAttack, HSJA). While Foolbox does not offer defense methods, it offers a variety of criteria and distance measures that can be used to define under what conditions a sample is adversarial and to quantify the size of the adversarial perturbation.
AdverTorch (Ding et al. 2019a) is a Python library for adversarial robustness research, providing tools for generating adversarial examples and defending against such adversarial attacks. The library is built on the PyTorch ML framework, leveraging the dynamic computational graph available in the framework to provide concise and efficient implementations. It includes a comprehensive collection of state-of-the-art attacks, predominantly gradient-based attacks (e.g., FGSM, PGD, C&W, Spatial Transformation) along with other attack types (e.g., SinglePixelAttack, LocalSearchAttack). Additionally, it contains defenses, divided into pre-processing techniques (e.g., JPEG Filtering, Bit Squeezing, Gaussian Smoothing) and training techniques (e.g., Adversarial Training).
AugLy (Papakipos and Bitton 2022) is a Python library that provides data augmentation techniques to evaluate and improve the robustness of ML models. Currently, it offers four sub-libraries with over 100 different methods for augmenting image, audio, video, and text data, using appropriate parameters to define the extent of the transformations. These augmentations include a wide variety of modifications, such as cropping images, changing voice tone, adding text or emojis to images. Augly is model- and framework-agnostic, making it suitable for both adversarial training and adversarial example generation.
TextAttack (Morris et al. 2020) is a Python library for adversarial training, adversarial attacks, and data augmentation in the field of Natural Language Processing (NLP). Unlike image-based models, the evaluation and achievement of adversarial robustness in NLP DL models has several differences. TextAttack systematically gathers and implements the best NLP adversarial attacks using a four-component system: (i) a goal function that determines if the attack succeeded, (ii) constraints that define which perturbations are valid, (iii) a transformation that creates potential modifications for each input, (iv) a search method that traverses the search space of possible perturbations. TextAttack is mainly intended for adversarial training of robust NLP models, model evaluation, development and easy use of new attacks to discover new vulnerabilities in NLP ML models, combining existing and new components. TextAttack is model-agnostic and framework-agnostic, meaning it can be used with any model built with any ML framework. The library offers pre-trained models for various NLP tasks and corresponding suitable wrappers for different ML frameworks and model repositories (e.g., TensorFlow, PyTorch, Scikit-Learn, Hugging Face), facilitating easy tool usage and fair comparisons between attacks across models. TextAttack includes a collection of over 16 state-of-the-art adversarial NLP attacks (e.g., BERT Attack) and ways to combine them. It also provides methods for data augmentation through appropriate transformations (e.g., random word replacement with synonyms, random word deletion). Additionally, it offers the capability for model training (e.g., adversarial training) and corresponding evaluation of model robustness. Table 11 summarizes the aforementioned technologies used for AML robustness.
Table 11. Open-source tools and libraries for tasks related to adversarial robustness
Library | ML Frameworks | Fields | Attacks | Defenses | Metrics |
|---|---|---|---|---|---|
ART | TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy | Images, tables, audio, video | Evasion, Poisoning, Extraction, Inference | Pre-processing, training, transforming, detection | Yes |
CleverHans | TensorFlow 2, PyTorch, JAX | Images, tables | Evasion | Adversarial Training | No |
FoolBox | TensorFlow 2, PyTorch, JAX | Images, tables | Gradient-based, Decision-based | No | No |
AdverTorch | PyTorch | Images, tables | Gradient-based, other | Adversarial Training | No |
AugLy | Framework-Agnostic | Image, audio, video, text | No | Data Augmentation | No |
TextAttack | Framework-Agnostic | Text | NLP Attacks | Adversarial Training, Data Augmentation | Yes |
Privacy
With respect to privacy-preserving ML, various tools and open-source libraries have been released so far, such as for applying DP, FL, SMPC, and HE.
TensorFlow Privacy (Google Research 2024b) is a Python library that includes implementations of TensorFlow optimizers (e.g., SGD, Adam) for training ML models using DP, as well as tools for analyzing and calculating the provided privacy guarantees, allowing for privacy-preserving model training with just a few additional lines of code. The library uses the DP-SGD algorithm to train models with DP, mitigating the risk of exposing sensitive training data.
PyTorch Opacus Yousef (pour et al. 2021) is a Python library that enables training PyTorch models with DP, with minimal code changes and a negligible impact on training performance. Opacus allows users to monitor the privacy budget being spent at any time. The library uses the DP-SGD algorithm to train models with DP, performing vectorized gradient calculations per sample, which is reportedly 10 times faster than the microbatching technique. The library is suitable for professionals who want to train privacy-preserving models easily without many changes, as well as for researchers who want to experiment.
TensorFlow Federated (TFF) (Google Research 2024a) is a Python framework for ML and executing computations on distributed data. It is primarily developed to facilitate research and experimentation with FL, where a common global model is trained on many participating clients that keep their training data locally. The library provides high-level building blocks and APIs that allow the implementation of FL or Federated Analytics and their evaluation on existing TensorFlow models. The TFF library is divided into two levels: the FL API, which offers high-level interfaces for implementing FL in TensorFlow models, and the Federated Core API, which offers lower-level interfaces for implementing algorithms in distributed environments.
CrypTen (Knott et al. 2021) is a Python framework built on PyTorch to facilitate research on privacy-preserving ML and privacy protection through SMPC as a mechanism for encrypting data between multiple participants. The framework allows researchers without a background in cryptography, to easily experiment with ML models using secure computation techniques, with objects that look exactly like PyTorch Tensors. Although CrypTen is built with real-world challenges in mind, it is primarily a research tool that has not been tested in production.
PySyft (Ziller et al. 2021a) is a Python framework that provides methods for implementing safer privacy-preserving ML models and data science tasks. PySyft decouples personal data from model training using techniques such as FL, DP, and Encrypted Computation, relying on interfaces that resemble numpy arrays, enabling easy integration with any ML framework. PySyft allows data scientists to query a dataset within privacy limits set by the data owner and receive answers without getting a copy of the data itself, ensuring both security and privacy.
SyferText Hall et al. (2021) is a Python library built for privacy protection in the field of NLP, using techniques such as FL and SMPC, for ML applications on sensitive data that cannot be shared or aggregated on a central machine. SyferText can be used to work on datasets located either locally or on remote machines. The two main use cases of this library are: (i) secure pre-processing of plain-text data, for safely processing text data located on a remote machine without violating privacy, and (ii) secure deployment of NLP pipelines, where elements that perform data pre-processing via SyferText, combined with trained privacy-preserving ML models via PySyft, are deployed securely.
Table 12 summarizes the technologies mentioned above that are meant to enable privacy-preserving ML.
Table 12. Open-source tools and libraries for tasks related to data privacy protection for privacy-preserving ML
Library | ML Frameworks | Fields | Techniques |
|---|---|---|---|
TensorFlow Privacy | TensorFlow | Cross-domain | DP |
PyTorch Opacus | PyTorch | Cross-domain | DP |
TensorFlow Federated | TensorFlow | Cross-domain | FL |
CrypTen | PyTorch | Cross-domain | SMPC |
PySyft | Agnostic | Cross-domain | HE, DP, FL |
SyferText | Agnostic | NLP | SMPC, FL |
Adversarial and robust machine learning in critical domains
In this section, we focus on the use of AI in four critical research and industrial fields, namely: (i) vehicles and autonomous driving, (ii) healthcare and medical diagnosis, (iii) energy and smart grids, and (iv) LLM-based NLP. These specific fields feature critical AI systems where robustness is critical to prevent real-world risks. Each one of them has unique vulnerabilities, data types, and regulatory demands, offering a comprehensive perspective on security and trustworthiness challenges. In these sections, we delve into the complex interaction between AI and reliability in these fields, thoroughly exploring the applications of AI and the inherent risks posed by adversarial examples. We also examine the adversarial attacks that pose the greatest danger to these AI systems in each field. Additionally, we provide a strong analysis of robust defense mechanisms tailored to strengthen these fields against threats, alongside sector-specific methods to evaluate these defensive strategies.
Vehicles and autonomous driving
Applications of artificial intelligence in the field
AI/ML and particularly the development of DL techniques and DNNs, have enabled new capabilities in the automotive sector (Qayyum et al. 2020) with diverse AI applications including: Autonomous driving systems (ADS), Driving Assistance, Connected Autonomous Vehicle Systems (CAVS), Environment Recognition, and Driver Monitoring. ADS handle many functions and processes, usually managed by some neural or DL process. One key function is trajectory prediction, responsible for predicting future spatial coordinates of road factors like vehicles and pedestrians (Zhang et al. 2022d). Another important task is driver monitoring, as drowsiness and driver distraction are major contributors to accidents, and such models are responsible for monitoring the driver’s movements (Kang 2013). Some tasks performed by modern autonomous or assisted driving systems include: (i) obstacle detection and avoidance; (ii) lane prediction; (iii) lane-keeping assistance; (iv) automated lane centering; (v) forward collision warning; (vi) trajectory generation or prediction; (vii) traffic sign recognition, (viii) adaptive cruise control; (Zhang et al. 2022d; Liu et al. 2021a; Feng et al. 2022). These functions, powered by AI, represent a significant leap forward in driving safety and efficiency, but they also come with challenges that require ongoing research and development.
Attacks-security gaps
Although enhanced with ML and DL techniques, ADS are vulnerable to various types of attacks that can jeopardize the safety of autonomous vehicles and their passengers. Autonomous driving models mostly suffer from evasion attacks, as most attacks on classifier models can also be applied there (Deng et al. 2020). Moreover, physical evasion attacks are a major threat for autonomous driving models (Song et al. 2018a), with physical adversarial object evasion attacks being particularly dangerous (Wang et al. 2023b). It has also been shown that autonomous driving models are susceptible to poisoning attacks (Jiang et al. 2020), which can effectively deceive classification models and lead to unexpected dangers (e.g traffic sign recognition misclassification). Lastly, although white-box attacks are possible, black-box attacks are more realistic since, in real-world scenarios, knowledge of the specific model used by each vehicle for autonomous driving most probably not available (Deng et al. 2021). These vulnerabilities have already manifested in real-world accidents, sometimes due to unintentional adversarial conditions or sensor malfunctions. Such a fatal accident occurred in 2016 regarding Tesla’s autopilot: the system failed to distinguish a bright sky from a white truck and crashed (The Guardian 2019).
Malicious attacks, however, are considered a more direct threat. Evasion attacks against autonomous driving models usually involve traffic sign recognition attacks–aiming at causing malfunctions in the recognition of road signs– or object detection attacks –aiming at causing malfunctions in object detection. For example, Zhang et al. (2019c) present CAMU, a black-box evasion attack that uses adversarial camouflage patterns on vehicles to avoid detection by Mask R-CNN models, resulting in a reduction of 40% in detection accuracy. In regards to white-box evasion attacks, Sitawarin et al. (2018b) present rogue signs, which by creating adversarial billboards or road signs it deceives the CNN-based detection models to successfully malfunction up to 95%. Similarly, ShapeShifter (Chen et al. 2019a) creates adversarial road signs by solving an optimization problem for attacking faster R-CNN models causing up to 93% failure in recognizing stop signs. Another white-box evasion attack, known as Disappearance Attack (Song et al. 2018a), involves covering road signs with adversarial posters or stickers, causing object detection systems to fail. This attack has proven effective in fooling state-of-the-art detectors like YOLO v2 and Faster R-CNN in both lab and outdoor settings. Lastly, Eykholt et al. (2018) propose an attack that can be performed in white or black-box settings, which creates physical adversarial perturbations in the form of black-and-white stickers on road signs leading, in some cases, to 100% success in misclassifying signs in CNN models.
In addition to the above, evasion attacks targeting autonomous driving models often aim to disrupt the driving system, and are commonly referred to as end-to-end (E2E) driving model attacks (Deng et al. 2021). Zhou et al. (2020) introduce such an attack called DeepBillboard, which manipulates adversarial billboards by solving an optimization problem. This results in significant deviations in the vehicle’s steering angle, with observed malfunctions of up to 23 degrees. A similar attack, presented by Kong et al. (2020), uses GANs to generate adversarial billboards, resulting in a steering angle deviation of up to 19.17 degrees.
Apart from evasion attacks, several notable adversarial poisoning attacks target ADS, focusing on areas such as traffic sign recognition and raindrop removal. Sitawarin et al. (2018a) present DARTS, an attack that creates adversarial road signs through out-of-distribution and lenticular printing attacks, successfully deceiving the model in all tested scenarios. Another approach targets traffic sign recognition by adding poisoned images containing small patterns to induce backdoor attacks in CNNs achieving over 95% accuracy in misleading the model with just 5% of backdoor images (Rehman et al. 2019). Additionally, a trojan attack on GANs was introduced in (Ding et al. 2019b), where poisoned image pairs lead to unintended transformations; for instance, when the GAN removes raindrops, it may inadvertently change red traffic lights to green or alter speed limit signs.
In Table 13, the aforementioned evasion and poisoning attacks are summarized, categorized according to the aforementioned threat model.
Table 13. State-of-the-art adversarial attacks in ADS
References | Knowledge | Target model | Attack type | Attack technique | Results |
|---|---|---|---|---|---|
Zhou et al. (2020) | White-Box | E2E driving model | Evasion | Creation of adversarial billboards through optimization problem solving | Malfunction up to 23 degrees in the steering angle of the autonomous vehicle |
Kong et al. (2020) | White-Box | E2E driving model | Evasion | Creation of adversarial billboards using GANs | Malfunction up to 19.17 degrees in the steering angle of the autonomous vehicle |
Boloor et al. (2020) | Black-Box | E2E driving model | Evasion | Designing black lanes on the road through Bayesian optimization method | Course change from right or left turn to straight in 25% of cases |
Yang et al. (2021) | Black-Box | E2E driving model | Evasion | Designing black lanes on the road through gradient-based optimization method | Course change (right, left, straight) in more than 70% of cases |
Boloor et al. (2019) | White-Box | E2E driving model | Evasion | Designing black lanes on the road through parameterization of their shapes for optimal course change | Successful forced crash |
Zhang et al. (2019c) | Black-Box | Object Detection | Evasion | Use of adversarial camouflage patterns on vehicles | Detection accuracy decreased by approximately 40% |
Song et al. (2018a) | White-Box | Traffic sign recognition | Evasion | Creation of adversarial poster/sticker | Failure to recognize road signs in almost 86% of video frames |
Sitawarin et al. (2018b) | White-Box | Traffic sign recognition | Evasion | Creation of adversarial advertising or road signs | Malfunction successfully up to 95% in digital or physical environment |
Eykholt et al. (2018) | White-Box, Black-Box | Traffic sign recognition | Evasion | Creation of physical adversarial perturbation in the form of black and white stickers on road signs | In some cases, up to 100% success in misclassifying signs |
Chen et al. (2019a) | White-Box | Traffic sign recognition | Evasion | Creation of adversarial road signs through optimization problem | Malfunction successfully up to 93% in not recognizing stop signs |
Sitawarin et al. (2018a) | White-Box, Black-Box | Traffic sign recognition | Poisoning | Out-of-Distribution and Lenticular Printing attacks | Successfully fooled the model in all scenarios |
Rehman et al. (2019) | White-Box | Traffic sign recognition | Poisoning | Backdoor attack | The model was fooled with over 95% accuracy using more than 5% backdoor images |
Ding et al. (2019b) | White-Box | Raindrop removal | Poisoning | Trojan attack | When the GAN removes the raindrops, it simultaneously turns the red light into green or changes the speed limit sign’s number |
Defenses-security measures
To address all the above adversarial attacks, various defensive techniques have been developed to fortify these ML systems. Defending ADS from adversarial attacks requires both proactive and reactive strategies. In proactive defenses, the idea is to improve model robustness while training, whereas in reactive defenses are used post-training for detection of attacks and mitigation of attacks at run-time. The two are complimentary to achieve system resilience, however each suffices limits and trade-offs.
Among proactive defenses, model hardening is a key strategy. Techniques like adversarial training (Goodfellow et al. 2015), improve robustness by retraining models with adversarial examples, making them more resistant to attacks (Tramèr et al. 2018). However, this method increases computational demands (Deng et al. 2021) and may fail to protect against novel adversarial examples (Kurakin et al. 2017). Another approach, certified robustness, focuses on creating models provably resistant to perturbations within a certain threshold (Li et al. 2023b; Lecuyer et al. 2019), but it suffers from reduced effectiveness on large datasets Li et al. (2023b). Network regularization, where perturbations are added during training to improve the model’s resilience (Yan et al. 2018; Gu and Rigazio 2015), offers some protection but is generally only effective against simpler attacks (Deng et al. 2021). Data pre-processing is another proactive defense category, focusing on minimizing adversarial examples by altering input data. One such method is defense distillation that focuses on training a robust model using the distillation method, where hidden information from the original model is distilled (Papernot et al. 2016b). However, adaptive attacks have been developed to counter this approach, diminishing its effectiveness (Carlini and Wagner 2017a). Techniques like feature squeezing (Xu et al. 2018) aim to reduce the feature space and prevent adversarial manipulations, although their effectiveness has been questioned again due to the development of adaptive attacks (He et al. 2017). Similarly, feature de-noising (Xie et al. 2019; Liao et al. 2018) attempts to clean input data from adversarial noise, achieving moderate success against white-box and black-box attacks (Xie et al. 2019). Finally, trajectory smoothing applies data augmentation techniques and smoothing algorithms during model training (Zhang et al. 2022d). Research has shown that it can reduce prediction error by 26% during attacks, though it also increases prediction error by 11% during normal operation (Zhang et al. 2022d).
Reactive defenses against adversarial attacks in ADS also employ various strategies, with data pre-processing and ADMs being commonly used approaches. One data pre-processing method is image transformation, where techniques like cropping, compression, and resizing are applied to defend against adversarial examples. This method has been shown to have a 70% success rate in protecting models when trained on transformed images (Guo et al. 2018; Das et al. 2017). Another method is adversarial transformation, which seeks to convert adversarial inputs into clean ones using models like MagNet (Meng and Chen 2017), APE-GAN (Jin et al. 2019), and DefenseGAN (Samangouei et al. 2018). However, this technique can potentially reduce model performance under normal conditions (Deng et al. 2021). Apart from data pre-processing techniques, ADMs play a significant role in reactive defenses. Adversarial detection methods like SafetyNet (Lu et al. 2017) and I-defender (Zheng and Hong 2018) identify adversarial examples but require additional computational resources (Deng et al. 2021). Ensembling defenses, such as PixelDefend (Song et al. 2018b), combine multiple models for stronger defense, but results have been mixed (He et al. 2017). Anomaly detection, which monitors system resource usage for abnormal spikes, shows effectiveness in some cases but minimal impact in others (Deng et al. 2020). Trajectory smoothing also serves as a reactive method, reducing prediction errors during attacks but causing a slight increase in error during regular operations (Zhang et al. 2022d).
Regarding protection against privacy attacks, techniques such as FL can be utilized for training autonomous driving models (Blika et al. 2024). A vehicle equipped with an autonomous driving system already on the road serves an additional purpose: collecting new data from the road and driver behavior to utilize in improving the training of new models in the future (Grigorescu et al. 2020). Beyond the data protection this method also offers a form of distributed training that results in better-performing models and more efficient training of autonomous driving models with more data (Kairouz et al. 2021).
In Table 14, the classification of defenses against adversarial attacks in ADS is presented, divided into proactive and reactive techniques, along with indicative examples for each type of defense.
Table 14. State-of-the-art adversarial defenses in ADS
References | Approach | Defense | Results |
|---|---|---|---|
Kurakin et al. (2017), Goodfellow et al. (2015), Tramèr et al. (2018) | Model Hardening | Adversarial Training | Increase in time and resources for model training Deng et al. (2021) and inability to protect against new adversarial examples Kurakin et al. (2017) |
Li et al. (2023b), Lecuyer et al. (2019), Raghunathan et al. (2018), Wong and Kolter (2018) | Model Hardening | Certified Robustness | Increase in time and resources for model training Deng et al. (2021), low levels of certified robustness accuracy for large datasets Li et al. (2023b) |
Yan et al. (2018), Gu and Rigazio (2015), Cisse et al. (2017b) | Model Hardening | Network Regularization | Increase in time and resources for model training and usually effective only for simple attacks Deng et al. (2021) |
Papernot et al. (2016b) | Data Pre-processing | Defense Distillation | Not as effective as initially thought, as adaptive attacks have been developed against this method Carlini and Wagner (2017a) |
Xu et al. (2018) | Data Pre-processing | Feature Squeezing | Not as effective as initially thought, as adaptive attacks have been developed against this method He et al. (2017) |
Xie et al. (2019), Liao et al. (2018) | Data Pre-processing | Feature Denoising | Detection of white-box attacks with a maximum accuracy of 55% and black-box attacks with a maximum accuracy of 49.5% Xie et al. (2019) |
Zhang et al. (2022d) | Data Pre-processing | Trajectory Smoothing | Prediction error reduction during attacks by 26%. Prediction error increase during normal operation by 11% |
Guo et al. (2018), Das et al. (2017) | Data Pre-processing | Image Transformation | For training using transformed images, a 70% average success rate in protection against adversarial examples was observed |
Meng and Chen (2017), Jin et al. (2019), Samangouei et al. (2018), Song et al. (2018b), Gu and Rigazio (2015) | Data Pre-processing | Adversarial Transformation | There is a likelihood of reduced model performance under normal conditions |
Lu et al. (2017), Zheng and Hong (2018), Metzen et al. (2017), Lee et al. (2018) | Runtime Detection | Adversarial Detection | It often requires additional computational resources, which may not always be available |
Song et al. (2018b) | Runtime Detection | Ensembling Defenses | The combination of weak defenses cannot provide a strong defense against adversarial examples |
Deng et al. (2020) | Runtime Detection | Anomaly Detection | In 2 out of 5 attacks, there is an additional 50% use of CPU memory and 35% use of GPU memory, while in the other 2 out of 5 attacks, there is almost no extra usage (less than 1%) |
Zhang et al. (2022d) | Runtime Detection | Trajectory Smoothing | Prediction error is reduced by 12% during attacks. However, prediction error increases by 6% during normal operation |
Evaluation of defenses and model robustness
Robustness trade-offs
In the context of ADS, achieving robust security against adversarial attacks often involves trade-offs. For example, one major consideration is the training of autonomous driving models which require large datasets and significant time. Adversarial training adds computational and time costs (Deng et al. 2021) and should only be used for specific attack risks, not for general protection. Techniques such as image and adversarial transformations can enhance model robustness without substantial computational costs, as they modify inputs during training or inference (Deng et al. 2021). However, most defense methods lead to a decrease in model accuracy on normal inputs, which may be unacceptable in critical systems like ADS (Deng et al. 2021). Furthermore, many techniques are tailored for classification models and may not yield the same success rates in regression tasks used by ADS (Deng et al. 2020).
In addition to accuracy concerns, many defenses result in increased response times. For instance, defenses that add extra stages or utilize additional models for attack detection typically increase response time for each input. This increase must be carefully monitored, as timely decision-making in autonomous driving is crucial; delays in real-time systems can have catastrophic effects (Deng et al. 2021). Therefore, real-time monitoring techniques, such as adversarial detection and anomaly detection, are valuable if they do not consume excessive resources or delay decision-making significantly (Deng et al. 2021).
A significant challenge also arises from the transferability of attacks. Tests indicate that attacks generally have low transferability success rates (ASR) to other autonomous driving models. Most attacks succeed more in white-box conditions than in black-box scenarios. A significant part of security against adversarial attacks lies in concealing information about the model and its parameters, as well as employing obfuscation techniques or protections against model extraction attacks like PRADA (Juuti et al. 2019).
Finally, combining multiple defense mechanisms is often necessary, as no single defense can protect against all attack types. Techniques like adversarial training and defensive distillation are effective only against specific attacks (e.g., FGSM) and may not enhance overall model robustness (Zhang et al. 2022d). Conversely, techniques that protect against a variety of attacks may compromise accuracy or increase false positive rates, as seen with feature squeezing (Deng et al. 2020).
Robustness measures
For autonomous driving models, the most commonly used robustness metrics include: (i) Empirical robustness (see Section 3.1.1), and (ii) Local loss sensitivity (see Sect. 3.1.1). Additionally, defense mechanisms can be evaluated by measuring the time required to detect adversarial samples and the detection rate of adversarial examples to assess recognition effectiveness.
When evaluating robustness in autonomous driving, it’s crucial to consider driving safety in addition to model accuracy. The ultimate goal is the safety of the vehicle and its drivers, rather than merely optimizing prediction accuracy.
Healthcare and medical image diagnosis
Applications of artificial intelligence in the field
The integration of AI and ML in healthcare has revolutionized the field, offering innovative solutions to improve patient care, streamline diagnostics, and enhance treatment outcomes. With the increasing digitization of health data, AI applications have become essential in addressing complex medical challenges.
Medical imaging stands out as one of the most impactful AI applications, as it can improve the diagnosis, treatment, and health monitoring, supporting tasks like disease classification, tumor detection, and image segmentation for further analysis (Apostolidis and Papakostas 2021). ML models can be integrated into existing computer-aided detection (CADe) and diagnosis (CADx) systems, which are used in clinics to automate image analysis. In the analysis of medical images, ML techniques are used for the efficient extraction of information from medical images obtained using different methods (e.g. MRIs, CTs) (Qayyum et al. 2021). Key tasks in medical imaging include: (i) classification or diagnosis (determining if a patient has a specific disease); (ii) detection (identifying diseases, mainly tumors); and (iii) segmentation (extracting specific parts like cells or organs for detailed analysis); (Apostolidis and Papakostas 2021). Notable ML models include YOLO for real-time detection (Redmon et al. 2016), GANs for generating synthetic data Zhang et al. (2022a), and Transformers / Recurrent Neural Networks (RNNs) (Vaswani et al. 2017) for analyzing clinical texts (Zhang et al. 2022a). These AI-driven innovations lead to more accurate diagnoses, improved treatments, and enhanced healthcare outcomes.
Attacks-security gaps
Healthcare systems are vulnerable to various adversarial attacks mainly due to the sensitive nature of the data they handle and the widespread adoption of ML models in diagnostic and administrative tasks. The primary attack vectors include adversarial examples, data privacy breaches, alongside poisoning and evasion attacks. Poisoning attacks are quite common in medical imaging applications, mainly due to the use of data from multiple unverified sources, which makes it easier for tainted data to enter training sets without deliberate manipulation (Mozaffari-Kermani et al. 2015). Similarly, evasion attacks can often occur unintentionally through small natural disturbances. Both white-box and black-box adversarial attacks have been studied in various medical applications (Finlayson et al. 2018; Apostolidis and Papakostas 2021; Qayyum et al. 2021). Note here that medical ML models are also prone to adversarial attacks due to the lack of large-scale medical datasets and the high standardization of medical images, which limits models’ robustness to minimal perturbations (Apostolidis and Papakostas 2021; Finlayson et al. 2018). Additionally, such ML models are susceptible to overfitting, making them easier targets for adversarial and privacy attacks.
Medical adversarial attacks can be classified based on the type of target medical images: MRI, X-Ray, PET, Ultrasound, CT, Histology, Fundoscopy. Paschali et al. (2018) examine adversarial attacks for two different applications: skin lesion classification using dermoscopy images and brain segmentation using MRI images. In the first case, attacks such as FGSM, DeepFool, and JSMA are evaluated in a black-box scenario on 3 state-of-the-art DL models: InceptionV3 (IV3), InceptionV4 (IV4), and MobileNet, while in the second case a specialized attack similar to FGSM but adapted to create adversarial examples per segment, called Dense Adversarial Generation (DAG), is evaluated on 3 popular CNN models: SegNet, U-Net, and DenseNet. For the classification tasks, the FGSM attack drops accuracy by 26% in IV3, 40% in IV4, and 59% in MobileNet; the DeepFool attack results in 0%, 5%, and 13% drops respectively, while JSMA leads to 0%, 4%, and 14% decreases. In segmentation tasks, the DAG attack reduces accuracy by 57% in SegNet, 43% in U-net, and 45% in DenseNet, in its best version.
Finlayson et al. (2018) use the PGD attack in both white-box and black-box scenarios for fundoscopy, X-ray, and dermoscopy images, using a pre-trained ResNet50 model, with accuracy dropping to 0% in the white-box scenario. In the black-box scenario, accuracy drops to 0% for fundoscopy, 15.1% for X-ray, and 37.9% for dermoscopy. Adversarial patches show slightly better results but still cause a 10%−20% drop in accuracy.
In (Huq and Pervin 2020), FGSM and PGD attacks in a white-box scenario are tested on dermoscopy images for skin cancer detection using MobileNet and VGG16 DL models, with FGSM reducing accuracy by 74% in MobileNet and 61% in VGG16. Similarly, PGD reduces accuracy by 74% and 63%, respectively. Further to this, Pal et al. (2021), use the FGSM attack in a white-box scenario to detect COVID-19 symptoms via X-ray and CT images. The attack led to significant drops in accuracy-up to 80% in VGG16 and 40% in InceptionV3 for X-rays, and similar reductions in CT images, with minimal visual distortion, making the changes imperceptible to the human eye but still sufficient to mislead the models.
In (Cheng and Ji 2020), a Universal Adversarial Perturbations (UAP) attack is used in a black-box scenario on MRI images from four different modalities (T1, T2, T1ce, FLAIR) for brain tumor segmentation on three U-net variations, with different perturbation sizes. The best results occur when the attack is applied to all four modalities, with accuracy dropping by 30% to 65%. In (Kotia et al. 2020), for brain tumor classification via MRI images using a CNN model, FGSM, Noise-based, and Virtual Adversarial Training (VAT) attacks are tested in a white-box scenario. Accuracy drops by 24% for FGSM, 69% for Noise-based, and 35% for VAT attacks.
Additionally, beyond standard adversarial attacks, Tian et al. (2021) explores the adversarial bias field (AdvSBF) attack for lung disease diagnosis using X-rays. The AdvSBF attack has an ASR of 38.69% in ResNet50, 34.49% in Dense121, and 33.51% in MobileNet, also having much better ASR and in transfer attacks to other models compared to classic adversarial attacks. Lastly, in (Kügler et al. 2019) the developed physical adversarial attack is used on dermoscopy images for the classification of skin lesions on 5 popular models: ResNet, InceptionV3, InceptionResNetV2, MobileNet, and Xception. In this attack, various physical disturbances are tested, all of which are successful on all models, with a maximum reduction in accuracy of up to 60% for some and an average AD of 30.8%.
To our knowledge, poisoning attacks in healthcare are notably under-explored. These attacks can arise from inadvertent mistakes during data entry or deliberate manipulation of data inputs, potentially compromising the integrity of healthcare algorithms. However, the literature reveals a limited number of studies focusing on such poisoning attacks within healthcare systems. One notable study, (Mozaffari-Kermani et al. 2015) systematically investigates poisoning attack methods targeting machine learning algorithms used in healthcare applications. The authors demonstrate the effectiveness of these attacks across various algorithms and datasets, notably without requiring prior knowledge of the specific algorithms employed. This gap in research may stem from several factors, including the sensitive nature of healthcare data, the early stage of awareness regarding AML in this field, and the ethical implications of testing such attacks in real-world healthcare environments. As a result, while poisoning attacks may be common in practice, the academic inquiry into their specific implications and defenses in healthcare remains limited.
In Table 15, all the above attacks are summarized. In general, the FGSM and PGD attacks seem to be the most effective, and most research focuses on applications to MRI, Dermoscopy, and X-ray images, as they are also the most easily available.
Table 15. State-of-the-art adversarial attacks in medical imaging systems
References | Knowledge | Target Model | Attack Type | Attack Technique | Results |
|---|---|---|---|---|---|
Paschali et al. (2018) | Black-Box | Inception, MobileNet, SegNet, U-net, DenseNet | Evasion | FGSM, DeepFool, JSMA, DAG | AD: (Classification): FGSM 26% IV3, 40% IV4, 59% MobileNet, DeepFool 0% IV3, 5% IV4, 13% MobileNet, JSMA 0% IV3, 4% IV4, 14% MobileNet |
Finlayson et al. (2018) | White-Box, Black-Box | ResNet50 | Evasion | PGD | AD: White-box - Fundoscopy: 91.0%, X-Ray: 94.9%, Dermoscopy: 87.6% Black-box - Fundoscopy: 90.99%, X-Ray: 79.8%, Dermoscopy: 49.7% |
Huq and Pervin (2020) | White-Box | MobileNet, VGG16 | Evasion | FGSM, PGD | AD: FGSM: 74% MobileNet, 61% VGG16. PGD: 74% MobileNet, 63% VGG16 |
Pal et al. (2021) | White-Box | VGG16, InceptionV3 | Evasion | FGSM | AD: (X-Ray) 80% VGG16, 40% InceptionV3. (CT) 45% VGG16, 40% InceptionV3 |
Cheng and Ji (2020) | Black-Box | U-net | Evasion | UAP | AD: 30%−65% |
Kotia et al. (2020) | White-Box | CNN | Evasion | Noise-based, FGSM, VAT | AD: FGSM 24%, Noise-based 69%, VAT 35% |
Tian et al. (2021) | White-Box, Black-Box | ResNet50, Dense121, MobileNet | Evasion | AdvSBF | ASR: 38.69% ResNet50, 34.49% Dense121, 33.51% MobileNet |
Kügler et al. (2019) | Black-Box | ResNet, InceptionV3, InceptionResNetV2, MobileNet, Xception | Evasion | Physical | Maximum accuracy reduction 60% and average AD 30.8% |
Mozaffari-Kermani et al. (2015) | White-Box | BFTree, Ridor, NBTree, IB1, MLP, SMO | Poisoning | Label Manipulation | Maximum effectiveness of 26% in reducing the attacked label percentage for the Breast Cancer dataset |
Defenses-security measures
To address the attacks mentioned in Sect. 4.2.2, several approaches have been developed to harden models against adversarial inputs. Folllowing the taxonomy of Sect. 2, these methods can be broadly categorized by mitigation approach into model hardening, data pre-processing, runtime detection usually implemented by ADMs, and privacy-preserving techniques.
In model hardening defenses, adversarial training is one of the most common defenses. In (Wu et al. 2020a), PGD-based adversarial training improved the adversarial accuracy of a ResNet32 model for diabetic retinopathy classification from 43% to 83%. Similarly, (Liu et al. 2021b) applied PGD adversarial training to a 3D ResUNet for pulmonary nodule detection in CT images, achieving 87% accuracy. In (Kotia et al. 2020), three adversarial training methods (FGSM, Noise-based, and VAT) were used to train a CNN model for brain tumor classification, with increases ranging from 22% to 77% depending on the attack. Furthermore, in (Vatian et al. 2019), adversarial training with FGSM and JSMA algorithms, combined with Gaussian noise-based data augmentation, was employed to train CNN models for detecting lung cancer and brain tumors. The models trained with FGSM, JSMA samples show better performance than the normal noise method. Similarly, in (Chen et al. 2020a), data augmentation using MRI image bias field disturbances (AdvBias) was used to strengthen the robustness of a U-Net model for heart MRI segmentation, outperforming methods like Random Augmentation and Mixup. In (Calivá et al. 2021) and (Cheng et al. 2020), a special robust training technique was applied to U-Net and I-RIM models for MRI reconstruction. It works by finding the worst-case false negatives of samples and using them in adversarial training, which significantly improved the reconstructed image quality and model’s effectiveness. Additionally, (Xue et al. 2019) introduced a noise-removal method incorporating an auto-encoder layer into a CNN, enhancing robustness against FGSM, IFGSM, and C&W attacks on X-ray and Dermoscopy images. Note here that adversarial training is among the most common defenses in healthcare applications being cost-effective due to limited data availability and highly effective compared to other approaches. However, it often leads to reduced accuracy on clean data, typically due to prior overfitting, with adversarial training acting as a regularizer to improve model generalization (Apostolidis and Papakostas 2021).
Regarding data pre-processing defenses, they modify the input data to improve model robustness. In (Park et al. 2020), a method similar to MagNet, but using frequency domain transformations, instead of the image space, and deep semantic segmentation models (like U-Net and DenseNet) is proposed. This method does not require knowledge about the model architecture or adversarial examples, so it is easily applicable. Additionally, it can detect adversarial examples with an average accuracy of 98%. Similarly, the Fuzzy Unique Image Transformation (FUIT) method (Tripathi and Mishra 2020), applies pixel downsampling to X-ray and CT images, maintaining high accuracy against six non-targeted attacks with over 95% adversarial accuracy.
With respect to defenses using ADMs, adversarial example detection is common, identifying malicious samples before they enter the model. This technique does not affect model accuracy since retraining is unnecessary, which is a crucial property for medical data that are often characterized by scarcity and non-diversity. In (Taghanaki et al. 2019), adversarial example detection was applied to X-ray and Dermoscopy images using a Mahalanobis distance-based feature learning technique. This method is model-agnostic, requiring only a change in the activation function without impacting model complexity. In classification tasks, adversarial accuracy ranges from 60.58% (C&W) to 98.79% (C&W), while ranges from 13% (BIM) to 91.31% (FGSM) across models. Similarly, (Uwimana and Senanayake 2021) applied Mahalanobis distance for malaria detection, achieving detection accuracies from 61.95% (DeepFool) to 99.95% (FGSM). Generally, ADMs are a crucial adversarial defense in the healthcare sector, as medical images tend to be more susceptible to disturbances than natural images. Also this approach does not reduce model accuracy and can also help in detecting poisoning attacks (Qayyum et al. 2021)
Proceeding with privacy-preserving defenses, one of the most effective methods for ensuring privacy is DP. DP has been applied in training private decision trees for diabetes prediction (Sun et al. 2019) and NNs for pediatric pneumonia and liver segmentation using DP-SGD (Ziller et al. 2021b). HE is another defense technique that enables secure data processing, such as predicting cardiovascular disease without revealing patient data (Bos et al. 2014) and training genetic models with encrypted genome data (Wood et al. 2020). Additionally, SMPC combined with HE supports privacy-preserving diagnostic systems (Li et al. 2020a) and secure wearable health data sharing (Tso et al. 2017). FL has been used for heart disease prediction (Brisimi et al. 2018), collaborative breast density classification across 7 clinical institutes, improving generalization by 45.8% (Roth et al. 2020), and lung cancer detection using DP-SGD for histopathology images (Adnan et al. 2022). FL with HE has supported COVID-19 diagnostic model training across 23 hospitals (Bai et al. 2021), while PriMIA uses FL and DP for private pediatric chest X-ray classification (Kaissis et al. 2021).
In Table 16, the classification of defenses against adversarial attacks in healthcare is presented, along with indicative examples for each type of defense.
Table 16. State-of-the-art adversarial defenses in medical imaging systems
References | Approach | Defense | Results |
|---|---|---|---|
Wu et al. (2020a) | Model Hardening | Adversarial Training (PGD) | Increase in adversarial accuracy from 43% to 83% |
Liu et al. (2021b) | Model Hardening | Adversarial Training (PGD) | Detection of adversarial examples with 87% accuracy |
Kotia et al. (2020) | Model Hardening | Adversarial Training (FGSM, Noise, VAT) | Accuracy increase: FGSM 24.6% (total 86.79%), Noise-based 77% (total 89.07%), VAT 22% (total 75.20%) |
Vatian et al. (2019) | Model Hardening | Adversarial Training / Data Augmentation (Gaussian Noise) | Training with FGSM, JSMA samples, better than Gaussian noise |
Chen et al. (2020a) | Model Hardening | Data Augmentation (Bias Field) | AdvBias performs better than Random Augmentation, Mixup, VAT methods |
Calivá et al. (2021), Cheng et al. (2020) | Model Hardening | Robust Training (FNAF) | Improved reconstructed images with SSIM 0.7197 ± 0.2613 |
Xue et al. (2019) | Model Hardening | Embedded denoising auto-encoder | Accuracy increase (): (X-Ray) FGSM 50%, IFGSM 44%, C&W 48%. (Dermoscopy) FGSM 31%, IFGSM 23%, C&W 40% |
Park et al. (2020) | Data Pre-processing | MagNet with Fourier | Adversarial example detection accuracy: 98% (average) |
Tripathi and Mishra (2020) | Data Pre-processing | FUIT (image downsampling) | Adversarial Accuracy is above 95% |
Taghanaki et al. (2019) | Runtime Detection | Out-of-Distribution Detection | Detection accuracy (Classification): For from 60.58% (C&W) to 98.79% (C&W). For from 13% (BIM) to 91.31% (FGSM) |
Uwimana and Senanayake (2021) | Runtime Detection | Out-of-Distribution Detection | Detection accuracy from 61.95% (DeepFool) to 99.95% (FGSM) |
Sun et al. (2019) | Privacy | DP | Training decision trees with privacy guarantees and 93% accuracy |
Ziller et al. (2021b) | Privacy | DP | Framework for training models with privacy guarantees |
Bos et al. (2014) | Privacy | HE | Encryption of medical data |
Wood et al. (2020) | Privacy | HE | Encryption of genetic data |
Li et al. (2020a) | Privacy | MPC, HE | Encryption of medical data with consent from hospitals, doctors, and patients |
Tso et al. (2017) | Privacy | MPC | Secure, confidential, and distributed data analysis |
Brisimi et al. (2018) | Privacy | FL | Faster convergence compared to centralized methods |
Roth et al. (2020) | Privacy | FL | 6.3% better average performance and 45.8% improvement in generalization compared to any centralized model |
Adnan et al. (2022) | Privacy | FL, DP | Privacy guarantees and similar performance to a centralized non-private model |
Bai et al. (2021) | Privacy | FL, HE | Better performance than any centralized model |
Kaissis et al. (2021) | Privacy | FL, DP | PriMIA Application of FL and DP-SGD in open-source ML framework |
Evaluation of defenses and model robustness
Robustness trade-offs
As aforementioned, the most studied defense for creating robust diagnostic models against adversarial examples is adversarial training. However, it does not provide the same protection against unknown attacks with samples on which the model has not been trained (Apostolidis and Papakostas 2021). It can also impact the accuracy of the model on clean samples, which is generally true for models in other applications, making the training of robust diagnostic models with high accuracy an open problem that still remains (Finlayson et al. 2019). The use of FL in recent years has allowed for the creation of models with higher accuracy due to the ability to use more data than a single organization, institute, or clinic would have available (Rieke et al. 2020). However, robust models seem to also exhibit some unforeseen advantages, such as improved generalization combined with regularization techniques, reducing the gap between accuracy and robustness, and making the models more resilient to adversarial or non-disturbance attacks (Apostolidis and Papakostas 2021). Specifically, in (Yi et al. 2021), using adversarial training, they showed that the model had improved generalization on out-of-distribution data, which is highly significant in medical image analysis applications.
The privacy of medical data is also critical, but applying privacy protection techniques can negatively impact model accuracy and training time required. Specifically, DP in training reduces the resulting utility depending on the privacy budget ; larger values add more noise, increasing privacy but decreasing model usability. DP in ML diagnostic models, especially in medical imaging, is a new research area, and the challenge of applying strict privacy guarantees for practical use remains unresolved (Kaissis et al. 2021).
Robustness measures
There are several general metrics that give insight into a model’s robustness. For diagnostic models, these include (i) empirical robustness (see Section3.1.1) and (ii) ASR (see Sect. 3.1.1). For privacy evaluation, the privacy budget or used during training can compare privacy guarantees, with smaller values indicating higher privacy security (Zhao et al. 2019).
Electrical power and energy systems
Applications of artificial intelligence in the field
AI plays a pivotal role in modern EPES and smart grids, enhancing operational efficiency and optimizing energy management (Karakolis et al. 2022, 2023). By processing vast data such as historical consumption patterns, real-time demand fluctuations, and weather conditions, it improves load forecasting (Pelekis et al. 2023b, 2022, 2024; Tzortzis et al. 2024) which is a critical task in short, medium, and long-term planning for energy dispatch (Raza and Khosravi 2015). Beyond load forecasting, AI plays a crucial role in EPES by enhancing key tasks in the domain including: i) Non-Intrusive Load Monitoring (NILM) (Zoha et al. 2012); ii) grid stability assessments (Omitaomu and Niu 2021) ensuring grid reliability, preventing outages Song et al. (2021); iii) flexibility and proper demand response (Pelekis et al. 2023a); iv) Event Cause Analysis (ECA) in power grids by processing volumes of data from multiple measurement points Niazazari and (Livani 2020); v) Electricity Theft Detection (ETD) (Stracqualursi et al. 2023).
Attacks-security gaps
The application of AI/ML in power systems, and particularly smart grids, introduces vulnerabilities to adversarial attacks, where covert adversarial examples can cause random or targeted disruptions (Hao and Tao 2022). While adversarial attacks are mostly studied in image classification, they are highly relevant to EPES and smart grid applications, affecting systems like false data injection detection (FDIA) (Cui et al. 2020), ETD, and load forecasting, potentially leading to operational disruptions and data breaches (Li et al. 2020b).
Regarding evasion attacks, both white-box and black-box, they are the most common in the domain, as the field of poisoning attacks in smart grids has not been extensively studied and there are not enough examples in the literature (Hao and Tao 2022). Evasion attacks target various subsystems, including FDIA detection, ETD, load forecasting, voltage stability assessment (VSA), power quality classification, ECA, occupancy detection (Yilmaz and Siraj 2021), and NILM (Wang and Srikantha 2021). Moreover, data privacy attacks can be applied to smart grids, as consumer data, such as electricity consumption measurements, are much more exposed compared to traditional power systems (Omitaomu and Niu 2021).
Specifically regarding FDIA detection system evasion attacks, they aim to disrupt systems detecting false data injected into measurements. They are mainly white-box attacks on Multi-Layer Perceptrons (MLPs), aiming to reduce accuracy of ML models used to detect FDIAs. Despite the success of ML models in detecting such attacks, they remain vulnerable to adversarial attacks. For example, in (Sayghe et al. 2020b), an MLP detecting FDIAs with 99% accuracy drops to 20% and 10% after 15 iterations of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) attack and JSMA attacks, respectively. Similarly, in (Sayghe et al. 2020a), a 99% accuracy MLP demonstrates significant performance degradation with the Targeted Fast Gradient Sign Method (TFGSM), dropping to 85% for , 40% for , and 10% for . Additionally, in (Li et al. 2023a), a DNN’s recall drops to as low as 0.9% when subjected to BIM-type attacks with varying perturbations.
Evasion attacks also target ETD, aiming to manipulate energy consumption data to evade the grid’s monitoring systems. These attacks generally assume that the attacker has access to the energy meters and can manipulate the readings sent to the network. These attacks can be white-box or black-box, with black-box attacks being more realistic in practical scenarios. In (Li et al. 2020b), the gradient-based SearchFromFree attack, similar to DeepFool, was evaluated on NNs (DNN, RNN, and CNN) demonstrating that adversarial examples could reduce detection accuracy to nearly 0% in white-box scenarios, and to 20% in black-box settings. This highlights that adversaries can bypass detection even without full knowledge of the system.
Additionally to ETD, adversaries also aim to evade load forecasting models, particularly short-term load forecasting (STLF), as affecting long-term predictions is harder due to the larger data volume and attack duration. These attacks introduce adversarial data, typically in a black-box scenario, to mislead ML models and cause operational damage. In (Chen et al. 2019b), a black-box attack was tested on models like FNN, RNN, and LSTM, revealing that even minor input disturbances in temperature inputs, could cause large forecasting errors. A deviation of just in temperature inputs led to a 13% increase in forecasting errors, which could significantly affect grid operations, increasing operational costs or leading to load shedding.
Another common target of evasion attacks are power quality classification models by introducing adversarial examples into voltage signal values aiming to influence these evaluation systems. In (Chen et al. 2018d), adversarial examples are generated using a method similar to the FGSM, where small disturbances are added to the voltage values close to the original values to avoid detection. Applied to an FNN model in a black-box scenario, the accuracy of the power quality classifier drops to 67.5% (from 97.5%) with 40% signal changes and disturbances of .
Regarding ECA, evasion attacks are aimed at disrupting the system that analyzes the causes of events. For this purpose, Niazazari and Livani (2020) show that FGSM and JSMA attacks on CNN models for this purpose result in a 76% ASR with in White-box scenarios and nearly 80% transferability in black-box scenarios from the substitute model.
Lastly, with respect to VSA, Song et al. (2021) test popular evasion attacks (FGSM, PGD, DeepFool, C&W, universal perturbations/networks) in both white-box and black-box scenarios on CNN models, with disturbances added to bus voltage values. The CNN model’s stability validation accuracy is 99.5%, but for input-specific White-box attacks, it drops to 57.6% with bus voltage measurement points under attack (PGD attack) and to 15.5% with (C&W). Black-box scenarios show smaller accuracy drops, with a minimum of 46.1% (PGD). Universal attacks show similar results in both scenarios, with accuracy dropping to 49.5%.
Beside evasion attacks, the literature on poisoning attacks in EPES, though still limited, highlights critical vulnerabilities in power grid infrastructures, particularly in load forecasting and state estimation (Agah et al. 2024). For instance, (Qureshi et al. 2022) explored how adversarial disruptions could target an LSTM-based load forecasting model, emphasizing the security risks rooted in FL. Note here that while the latter enhances data privacy, it also creates multiple surfaces for potential disruption in case certain FL agents are compromised injecting poisoned weights to the central server.
Similarly, research on privacy attacks in EPES remains scarce, with no specific studies documenting data confidentiality breaches caused by ML models in smart grids. However, ML models are known to be susceptible to attacks such as model extraction, model inversion, and membership inference (Kumar et al. 2019), which could enable adversaries to reconstruct sensitive training data or gain insights into the model itself. These gaps in the literature highlight the necessity for dedicated research and practical frameworks to prepare against these emerging threats.
In Table 17, all the above attacks are summarized.
Table 17. State-of-the-art adversarial attacks in EPES and smart grids
References | Knowledge | Target Model | Attack Type | Attack Technique | Results |
|---|---|---|---|---|---|
Sayghe et al. (2020b) | White-Box | MLP | Evasion | L-BFGS, JSMA | L-BFGS: 20% accuracy, JSMA: 10% accuracy |
Sayghe et al. (2020a) | White-Box | MLP | Evasion | TFGSM | AD: 85% for , 40% for , 10% for |
Li et al. (2023a) | White-Box | DNN | Evasion | BIM | Min recall 0.9%, Max recall 18.9% |
Li et al. (2020b) | White-Box, Black-Box | DNN, RNN, CNN | Evasion | SearchFromFree | White-box: 0% accuracy from 86.9% in DNN, Black-box: 20% from 97.5% in RNN |
Chen et al. (2019b) | Black-Box | FNN, RNN, CNN | Evasion | Adversarial perturbations to temperature values | Max prediction error: 13% from 1.5% with deviation in temperature |
Chen et al. (2018d) | Black-Box | FNN | Evasion | FGSM | AD: 67.5% from 97.5% initially with , |
Song et al. (2021) | White-Box, Black-Box | CNN | Evasion | FGSM, PGD, DeepFool, C&W, Universal | White-box: Accuracy 57.6% with (PGD attack), 15.5% with (C&W), Black-box: Accuracy 46.1% with (PGD), 49.5% with (Universal) |
Niazazari and Livani (2020) | White-Box, Black-Box | CNN | Evasion | FGSM, JSMA | White-box: 76% ASR with , (original accuracy 99.5%), Black-box: 80% transferability with (substitute accuracy 94%) |
Qureshi et al. (2022) | White-Box | LSTM | Poisoning | Sign flipping attack, additive noise attack | Sign flipping attack significantly reduces the accuracy of load forecasting, resulting in a MAPE of 13.98 % |
Defenses-security measures
To protect power systems and smart grids from adversarial attacks, a variety of defensive techniques have been developed. These defenses can be grouped into the aforementioned categories–data pre-processing, model hardening, and runtime detection (usually implemented by ADMs). Each of these categories targets specific aspects of attack prevention and detection, reinforcing the robustness of ML models and ensuring system stability in the face of various adversarial threats. In data pre-processing defenses, techniques like defensive distillation are used to handle FDIAs, while APE-GAN addresses VSA attacks (Song et al. 2021). Model hardening defenses include adversarial training to mitigate FDIAs (Li et al. 2023a), VSA attacks (Song et al. 2021), and ECA attacks (Niazazari and Livani 2020), as well as gradient masking to smooth gradients and prevent optimization-based adversarial attacks (Hao and Tao 2022). ADMs strengthen defenses through methods like adversarial detection for FDIAs (Li et al. 2023a), classifier ensembles for detecting adversarial examples in ETD models (Takiddin et al. 2021), and attack detection techniques for smart grid systems (Omitaomu and Niu 2021).
Regarding FDIAs, while defensive distillation reduces the model’s sensitivity to input perturbations, it fails to effectively provide robustness against adaptive attacks (Li et al. 2023a). Adversarial training, despite being one of the most effective empirical defenses, it introduces significant computational overhead, making it not scalable for large systems with massive data resources (Li et al. 2023a). However, in (Tian et al. 2019), the application of adversarial training significantly enhances the robustness of DNN models against FGSM attacks, with 98.4% test accuracy. Finally, adversarial detection also struggles as it assumes adversarial examples follow different distributions, which doesn’t apply to FDIAs while the random input padding framework is proposed as a more effective solution, achieving 96% accuracy and 95% recall against -bounded perturbations Li et al. (2023a).
To address ETD, Takiddin et al. (2021) propose an ensemble detector using auto-encoders with attention, GRUs, and FNNs to counter poisoning attacks. The ensemble detector is capable of detecting energy theft with a high detection rate of 95.2% and false alarms as low as 2.9%, with a performance variation of 1–3% in the event of poisoning attacks.
In regard to VSA, (Song et al. 2021) tests adversarial training with FGSM and PGD algorithms against FGSM, PGD, DeepFool, C&W, and Universal Perturbations/Networks attacks in both white-box and black-box scenarios. PGD adversarial training achieves over 80% accuracy when fewer than 30 bus voltage points are attacked. APE-GAN is also evaluated, but it is effective only in white-box scenarios.
Finally, to counter ECA attacks, Niazazari and Livani (2020) explore adversarial training to create a robust CNN classifier. The adversarially trained model is highly effective, with only a 5% ASR even for perturbations with , which is large enough for the sample to be detected by bad data detectors.
In addition to the above defenses against evasion attacks, general data protection methods can safeguard ML model data for the protection of privacy of smart grids and power systems. These methods not only ensure privacy, critical for system security and robust ML models, but also defend against other attacks, such as FDIA detection systems (Li et al. 2022). As aforementioned, energy consumption data in smart grids can be intercepted by attackers. Detecting FDIA without accessing personal consumption data is a key research area (Cui et al. 2020). One solution uses HE for confidentiality and privacy in CNN models detecting abnormal behavior, achieving 92.67% accuracy in ETD (Yao et al. 2019). Moreover, FL protects data privacy by avoiding data sharing and scales computations by performing them locally (Cui et al. 2020). In (Li et al. 2022), an FL-based FDIA detector combines FL with a transformer for attack detection, using HE to protect NN weights, outperforming traditional DL methods while safeguarding privacy.
In Table 18 all the above defenses are summarized.
Table 18. State-of-the-art adversarial defenses in EPES and smart grid
References | Approach | Defense | Results |
|---|---|---|---|
Tian et al. (2019) | Model Hardening | Adversarial Training | 98.4% test accuracy and average perturbation value 0.14 in FGSM |
Li et al. (2023a) | Model Hardening | Random Input Padding Framework | 96% accuracy and 95% recall, in |
Takiddin et al. (2021) | Runtime Detection | Ensemble Detector | 95.2% detection rate and 2.9% false alarm rate, with 1–3% performance variation under poisoning attacks |
Song et al. (2021) | Model Hardening | Adversarial Training | Over 80% accuracy for (with PGD adv training) in all scenarios |
Song et al. (2021) | Data Pre-processing | APE-GAN | Over 80% accuracy for only in the White-box scenario for FGSM, PGD, C&W |
Niazazari and Livani (2020) | Model Hardening | Adversarial Training | 5% ASR with |
Yao et al. (2019) | Privacy | HE | 92.67% detection rate and data privacy |
Li et al. (2022) | Privacy | FL | Better performance than existing CNN, LSTM-based detectors |
Evaluation of defenses and model robustness
Robustness trade-offs
Training ML models with measurement data from smart grids requires large datasets and considerable training time, especially when employing techniques like adversarial training, which increase computational costs (Li et al. 2023a). These costs must be carefully considered when developing robust models. However, in all the cases examined, we found that the drop in accuracy on clean data was insignificant. Models trained on measurement data already achieve high baseline accuracy (around 90%), and the trade-off in clean-data performance is minimal compared to applications such as image classification, where adversarial defenses often cause significant degradation.
Robustness measures
For models in power systems, the standardized metrics discussed in Sect. 3.1.1 are not commonly used; instead, model performance is evaluated under variable attack sources. Specifically, metrics such as precision, recall, or F1-score are evaluated for various bus systems under attack, where the adversarial measurements should closely match the performance of the original model (Li et al. 2022). Additionally, as mentioned in the case of VSA detectors, any model with a detection accuracy below 80% is considered ineffective (Song et al. 2021). Furthermore, for energy forecasting models, the Mean Absolute Percentage Error (MAPE) metric is usually used to evaluate the prediction error and the input feature deviation caused by the simultaneous addition of adversarial perturbations (Chen et al. 2018d).
Large language model-based natural language understanding
In recent years, the field of NLP and more specfically Natural Language Understanding (NLU) has undergone significant advancements, largely driven by the emergence of LLMs. These models, exemplified by architectures like OpenAI’s GPT-4 (Achiam et al. 2023), have significantly enhanced the ability of machines to comprehend and generate human-like text.
Applications of artificial intelligence in the field
Since their introduction, LLMs have been extensively applied across multiple fields due to their expertise in natural language processing tasks. LLMs like ChatGPT have been utilized to interpret complex financial communications. For instance, Wu et al. (2023) evaluated ChatGPT’s ability to comprehend the financial statements, demonstrating its potential to aid investors and policymakers in decoding central bank signals and making informed decisions. In healthcare, LLMs have been explored for their capacity to encode clinical knowledge and assist in medical tasks. Singhal et al. (2023) found that performance in answering medical questions-which requires recall, reading comprehension, and reasoning skills-improves as the scale of LLMs increases. In the rapidly evolving intersection of LLMs and gaming, models are now being leveraged to create agents capable of autonomous learning and adaptability in complex virtual environments. VOYAGER (Wang et al. 2023a) is the first LLM-powered embodied lifelong learning agent. Through interaction with a black box LLM, it can master a wide range of skills and continually make new discoveries without human intervention. LLMs are also used in biology for tasks like protein structure prediction, genomic sequence analysis, and drug discovery by leveraging their ability to understand and generate biological data. Ferruz et al. (2022) developed ProtGPT2, a language model trained on protein space, that generates new protein sequences that follow the principles of natural ones. Recently, LLMs have also been used as the foundation to build AI agents. These agents are artificial entities that are context aware, make decisions, and perform actions (Xi et al. 2023). By combining multiple LLM agents, advanced AI architectures can be developed to handle complex tasks (Wu et al. 2024).
Attacks-security gaps
The adversarial attack surface in LLMs encompasses various vulnerabilities, primarily involving prompt injection attacks, extraction of sensitive data, and poisoning. Prompt injection attacks, such as jailbreaks, can manipulate prompts to bypass restrictions, leading to unauthorized access or harmful outputs. These include prompt-level and token-level jailbreaks, which require creative prompt crafting or adversarial token manipulation, respectively. Sensitive data extraction through the exploitation of tools and external API calls can reveal private information or produce unintended responses that expose system weaknesses. Data poisoning attacks, especially in FL systems, introduce compromised data during training, creating harmful backdoors that alter the model’s behavior. Additionally, LLMs’ integration with external tools increases their susceptibility to indirect prompt injections, making them vulnerable to executing unintended actions.
Regarding prompt injection attacks, Zou et al. (2023) proposed Greedy Coordinate Gradient (GCG), an adversarial attack on aligned language models that generates harmful content using automatically crafted suffixes. Their method, which extends the AutoPrompt (Shin et al. 2020) approach, combines greedy and gradient-based search techniques to optimize adversarial prompts. While the attack was able to successfully break the target white-box model (98% ASR on Vicuna-7B), it also demonstrated strong transferability to black-box models, achieving high success rates on models such as gpt3.5-turbo-0301 and gpt4-0314 (ASR of 87.9% and 46.9% respectively, for attacks optimized on Vicuna & Guanacos using the ensemble method). AutoDAN is another white-box attack method, which can transfer effectively to black-box models. With the use of a hierarchical genetic algorithm, AutoDAN crafts prompts that bypass LLM safeguards to produce unintended or harmful responses. In their study, Liu et al. (2024a) demonstrate that even highly restricted models remain vulnerable, achieving over 60% success rate in evading safety mechanisms (97% ASR on Vicuna-7B, 70% ASR on gpt3.5-turbo-0301 with transfer from vicuna), exposing critical gaps in LLM security. To overcome limitations associated with white-box attacks, Chao et al. (2023) propose Prompt Automatic Iterative Refinement (PAIR), an efficient method for generating interpretable prompt-level jailbreaks for black-box language models. PAIR automates the process by pitting two LLMs against each other, with one acting as an attacker to discover jailbreaking prompts. The approach achieves higher success rates across various models, while being more efficient and transferable than existing methods (65% ASR when transferring the attack from gpt-4–0125-preview to gpt3.5-turbo-1106 using Mixtral 8x7B Instruct as the attacker).
Qiang et al. (2024) provide a more efficient method against data poisoning compared to previous works of Wan et al. (2023) and Xu et al. (2023), incorporating single-token backdoor triggers into the content while keeping the instruction and label unchanged. A gradient-guided backdoor trigger learning algorithm (GBTL) is also employed to efficiently identify adversarial triggers. Using this method, poisoning (just 1% of 4,000 instruction-tuning samples) results in a Performance Drop Rate (PDR) of approximately 80% on Alexa Massive (FitzGerald et al. 2023), on LLaMA2-7b. Composite Backdoor Attacks (CBA) (Huang et al. 2024) implants multiple backdoor trigger keys in different prompt components. With just 3% of the training data poisoned in the LLaMA-7B model using the Emotion dataset, the attack successfully achieves a 100% ASR with a False Triggered Rate (FTR) below 2.06%, while causing minimal degradation in overall model accuracy.
In Table 19, all the above attacks are summarized.
Table 19. State-of-the-art adversarial attacks in LLM-based NLP systems
References | Knowledge | Target Model | Attack Type | Attack Technique | Results |
|---|---|---|---|---|---|
Zou et al. (2023) | White-Box | Vicuna-7B | Privacy, Evasion (Prompt Injection) | GCG | Optimized on Vicuna, achieving 98% ASR |
Zou et al. (2023) | Black-Box | GPT3.5-turbo-0301 | Privacy, Evasion (Prompt Injection) | GCG | GCG ensemble optimized on Vicuna & Guanacos with 87.9% ASR |
Zou et al. (2023) | Black-Box | GPT-4–0314 | Privacy, Evasion (Prompt Injection) | GCG | GCG ensemble optimized on Vicuna & Guanacos with 46.9% ASR |
Liu et al. (2024a) | White-Box | Vicuna-7B | Privacy, Evasion (Prompt Injection) | AutoDAN | HGA optimized on Vicuna, achieving 97% ASR |
Liu et al. (2024a) | Black-Box | GPT3.5-turbo-0301 | Privacy, Evasion (Prompt Injection) | AutoDAN | HGA optimized on Vicuna, achieving 70% ASR |
Chao et al. (2023) | Black-Box | GPT3.5-turbo-1106 | Privacy, Evasion (Prompt Injection) | PAIR | Transferred from GPT-4–0125-preview using Mixtral 8x7B Instruct, achieving 65% ASR |
Qiang et al. (2024) | White-Box | LLaMA-7B | Poisoning | GBTL | 1% poisoned samples on Alexa Massive dataset, achieving 80% PDR |
Huang et al. (2024) | White-Box | LLaMA-7B | Poisoning | CBA | 3% poisoned samples on Emotion dataset, achieving 100% ASR |
Defenses-security measures
Defending against adversarial attacks on LLMs involves both reactive and proactive strategies. Reactive defenses aim to detect and mitigate adversarial prompts in real-time. Jain et al. (2023) investigate two different types of reactive defenses, perplexity based detection and paraphrase/retokenization during pre-processing. Perplexity-based detection uses the model’s own uncertainty to identify and filter out adversarial prompts by measuring unusually high perplexity scores. Paraphrasing and retokenization aims to transform initial prompts in order to disrupt adversarial behavior. SmoothLLM (Robey et al. 2023) is an algorithm designed to defend against jailbreaking attacks on LLMs like GPT and Llama. It works by introducing random character-level changes to multiple copies of an input prompt and aggregating predictions, leveraging the brittleness of adversarial prompts to detect and mitigate such attacks effectively. It is compatible with both black-box and white-box LLMs. Apart from data pre-processing defense methods, there are also word based defense mechanisms, like ONION (Qi et al. 2021), that aim to detect outlier words in a sentence, that are very likely to be related to backdoor triggers.
On the proactive side, Reinforcement Learning from Human Feedback (RLHF) is a model hardening technique that enhances model’s ability to resist adversarial inputs by reinforcing behaviors that align with human feedback. In the context of attacks like AutoDAN, RLHF helps the model recognize and avoid generating harmful content even when presented with manipulative prompts (Ouyang et al. 2022; Bai et al. 2022). In addition to pre-processing methods, Jain et al. (2023) explore adversarial training, which involves exposing the model to adversarial examples during training to enhance its ability to withstand future attacks. Sha et al. (2024) show that fine-tuning can effectively remove backdoors from ML models while maintaining high model utility. However, Anthropic revealed that safety training techniques, including supervised fine-tuning, reinforcement learning and adversarial training, fail to remove the backdoors effectively, creating a false impression of safety (Hubinger et al. 2024). In this context, Li et al. (2024) propose SANDE, a method that combines parrot prompt learning (to simulate triggers’ behaviors) and the use of Overwrite Supervised Fine-Tuning (OSFT) to eliminate backdoor behavior.
In Table 20 all the above defenses are summarized.
Table 20. State-of-the-art adversarial defenses in LLM-based NLP systems
References | Approach | Defense | Results |
|---|---|---|---|
Jain et al. (2023) | Runtime detection | Perplexity-based detection | 0% ASR forZou et al. (2023) attack |
Jain et al. (2023) | Data pre-processing | Paraphrasing/ Retokenization | 0.05% ASR using paraphrasing for Vicuna-7B-v1.1 |
Robey et al. (2023) | Data pre-processing | SmoothLLM | Reduces the ASR of GCG to below one percentage point |
(Qi et al. 2021) | Runtime detection | ONION | For a trigger-embedded poisoned test sample, 0.76 trigger words and 0.57 normal words are removed on average |
Ouyang et al. (2022), Bai et al. (2022) | Model hardening | RLHF | Generally enhances performance in larger models but can diminish performance in smaller ones (Bai et al. 2022) |
Sha et al. (2024) | Model hardening | Fine-tuning (for backdoor removal) | 0.932 CA with an ASR of 0.009, when mitigating the BadNets attack on CIFAR-10 |
Li et al. (2024) | Model hardening | SANDE, OSFT | Up to 0% ASR for both in-domain and out-of-domain datasets |
Evaluation of defenses and model robustness
Robustness trade-offs
Enhancing a model’s robustness against adversarial attacks often comes with a trade-off that affects model accuracy on clean (non-adversarial) data (Li et al. 2023c). This phenomenon arises because methods designed to improve robustness-such as adversarial training or incorporating regularization techniques-can cause the model to generalize differently, potentially reducing its performance on standard inputs.
On black-box defenses, query complexity and computational overhead might increase, impacting latency as well as model accuracy. This is particularly evident in scenarios where high robustness is required, as the number of perturbations needs to be increased to achieve low ASRs (Robey et al. 2023).
Backdoor defenses, like fine-tuning, offer an effective way to reduce ASR while preserving high accuracy on clean data. However, these methods may introduce new vulnerabilities, such as reinjection attacks, where the adversary reintroduces the same backdoor with reduced effort, or membership inference attacks, which can compromise data privacy (Sha et al. 2024).
Robustness measures
Extensive research has been conducted on the evaluation of LLMs, covering various aspects including robustness, ethics, and general performance across diverse tasks (Chang et al. 2024). In terms of robustness, the following metrics have been proposed: (i) PDR (Zhu et al. 2023) measures the relative decrease in performance after a prompt attack, providing a standardized way to compare the impact of various attacks, datasets, and models. (ii) ASR (Wang et al. 2021) (see Section3.1.1). (iii) Robust Test Score (RTS) (Liu et al. 2024b) measures the model’s performance on adversarial inputs, offering insights into how well the model can withstand adversarial attacks; for different tasks, RTS represents the accuracy on adversarial samples, rejection rate of malicious inputs, or correct response rate, while higher RTS values indicate stronger robustness. (iv) FACTSCORE (Min et al. 2023) is a metric for evaluating factual accuracy in long-form text generation; it decomposes generated text into fine-grained atomic facts and compares them to reference facts from ground truth data, identifying factual errors precisely to enhance models’ factual precision and overall content accuracy.
Discussion
This review comprehensively investigates AML techniques, tools, and applications within critical domains, shedding light on key advancements while also revealing several research and practice gaps. Despite the remarkable progress in AML at this point, the field is still fragmented and significant challenges remain in achieving robust, privacy-preserving, and trustworthy AI.
Initially, the real-world validation of most AML techniques still remains limited. Most state-of-the-art tools and benchmarks, such as RobustBench and AutoAttack, are tested under laboratory conditions that usually combine a simplified threat model with a minimal set of attacks and , perturbations on standard datasets. While these choices enable reproducible and fair comparisons, they naturally limit the coverage of adaptive black-box attacks or spatial transformations that may appear in real-world scenarios. Said conditions are not fully representative of the real-world adversarial scenarios characterized by emerging novel attack techniques and dynamic operational requirements for evaluating the robustness of AI systems. Therefore, the limited generalization of such benchmarks underlines the need for adaptive, resource-efficient evaluation frameworks supported by standardized benchmarking practices and automated tools. The close collaboration between academia (new defenses), industry (real-world conditions), and open-source communities (extensive independent testing) could further facilitate the deployment and testing of AML methods in realistic settings, build consensus on comprehensive albeit resource-efficient practices, and bridge the gap between theoretical progress and practical implementation.
The cross-domain impact of this review was limited to four industries, namely automotive, healthcare, EPES, and LLM-driven NLP. This deliberate scoping decision was aimed at balancing the trade-off between analysis depth and generalization across critical and developing industry fields. However, in this process, other crucial sectors, such as finance, telecommunications, and public safety, which also face unique adversarial challenges and certain criticality, were overlooked. For instance, many of these industries deal with high-stakes risks such as fraud, misinformation, and fundamental human rights. Given the above, future surveys are strongly encouraged to investigate the current AML landscape in these sectors, towards domain-specific investigations that can further broaden our horizons on adversarial attacks and defenses in research and industry.
Yet another important question in AML research refers to the trade-off between robustness and performance. While some techniques, such as adversarial training, enhance robustness, they are usually accompanied by a degradation in accuracy on clean inputs and an increase in computational costs. These trade-offs, together with resource constraints at deployment time, constitute major barriers to the widespread adoption of robust ML, especially in real-time or resource-constrained environments. To this end, extended studies are needed to empirically quantify these trade-offs and provide actionable insights for practitioners seeking to optimize ML model performance while maintaining adversarial robustness.
Finally, the continuously evolving nature of AML methods themselves creates challenges and opportunities. Although recent advances in explainable AI, FL, and SMPC hold great promise for forming the basis of robust, privacy-preserving systems, their benchmarking and tooling requires constant updates to keep up with the state-of-the-art. Ensuring that these advances are translated into practical, scalable solutions requires continuous interdisciplinary efforts and a commitment to ongoing collaboration across the AML research community.
Conclusions
In this review, we discussed AML by covering its basics, methods, tools, and applications in critical industry sectors. The main objective of this paper was to provide a structured overview of the field, with a particular focus on challenges in robustness and privacy of ML systems. The study addressed key research questions such as the categorization of adversarial attack and defense techniques; the availability and applicability of benchmarking tools; and the specific challenges and implementations across distinct, popular, and critical industrial domains in terms of trustworthy AI.
Our major findings highlighted the abundance and depth of adversarial techniques and accompanying defense mechanisms, focusing on their real-world applications. The respective reviewed benchmarks and tools, such as RobustBench and AutoAttack, highlight the significant advancements in evaluating robustness and privacy, while also revealing shortcomings, such as limited coverage of generalized or novel attack scenarios. These insights can provide valuable guidance for both researchers and practitioners interested in AI systems’ robustness and privacy.
Additionally, aiming to provide a generalized outline of the current AML landscape, we conducted a cross-industry analysis referring to the automotive, healthcare, EPES, and LLM-based NLP domains. In this context, we explained how adversarial techniques and their respective defenses manifest uniquely for each domain, offering practical insights into adaptation and challenges to AML strategies.
In summary, this review highlighted the necessity of robust, privacy-preserving, and trustworthy AI systems, particularly in high-stakes domains. Addressing the identified gaps in this study requires harmonization of robustness and privacy research, expansion of cross-sectoral applications, and bridging the gap between theoretical progress and practical deployments. Only then can comprehensive and sustained efforts enable AML to grow to meet the demands of an increasingly adversarial world.
Acknowledgements
This work is supported by the Smart Networks and Services Joint Undertaking (SNS JU) under the European Union’s Horizon Europe research and innovation programme under Grant Agreement No 101139198, iTrust6G project. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or SNS-JU. Neither the European Union nor the granting authority can be held responsible for them.
Funding
Open access funding provided by HEAL-Link Greece.
Nomenclature
Limited-memory Broyden-Fletcher- Goldfarb-Shanno
Mean absolute percentage error
Artificial intelligence
Adversarial bias field
Autonomous driving systems
Auxiliary detection model
Adversarial machine learning
Application programming interface
Adversarial robustness toolbox
Attack success rate
Boundary attack
Boundary differential privacy layer
Basic iterative method
Computer-aided diagnosis
Computer-aided detection
Camouflage pattern attack
Composite backdoor attack
Convolutional neural network
Carlini & Wagner
Computed tomography
Dynamic adversary training
Dense adversarial generation
Deep learning
Deep neural network
Differential privacy
Differentially private stochastic gradient descent
End-to-end
Event cause analysis
Electrical power and energy systems
Electricity theft detection
Fast adaptive boundary
False data injection attack
Fast gradient sign method
Federated learning
Feed-forward neural network
False triggered rate
Generative adversarial network
Gaussian data augmentation
Gradient-guided backdoor trigger learning
Greedy coordinate gradient
Gradient-weighted class activation mapping
Homomorphic encryption
HopSkipJump attack
InceptionV3
InceptionV4
Jacobian saliency map attack
Large language model
Long short-term memory
Machine learning
Multilayer perceptron
Magnetic resonance imaging
Neural network
Outlier noisy input detection
Prompt automatic iterative refinement
Projected gradient descent
Performance drop rate
Reinforcement learning from human feedback
Recurrent neural network
Random self-ensemble
Robust test score
Secure multi-party computation
Short-term load forecasting
Support vector machine
Targeted longitudinal adversary
Total variance minimization
Universal adversarial perturbation
Virtual adversarial training
Voltage stability assessment
Natural language processing
Zeroth order optimization attack
Overwrite supervised fine-tuning
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Zhang L (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, page 308-318, New York, NY, USA. Association for Computing Machinery. ISBN 9781450341394. https://doi.org/10.1145/2976749.2978318
Abdulrahman T, Muhammad I, Usman Z, Erchin S (2021) Robust electricity theft detection against data poisoning attacks in smart grids. IEEE Transactions on Smart Grid, 12(3):2675–2684. https://doi.org/10.1109/TSG.2020.3047864
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S et al. (2023) Gpt-4 technical report.
Adnan, Q; Junaid, Q; Muhammad, B; Ala, A-F. Secure and robust machine learning for healthcare: a survey. IEEE Rev Biomed Eng; 2021; 14, pp. 156-180. [DOI: https://dx.doi.org/10.1109/RBME.2020.3013489]
Adnan, M; Kalra, S; Cresswell, JC; Taylor, GW; Tizhoosh, HR. Federated learning and differential privacy for medical image analysis. Sci Rep; 2022; 12,
Agah N, Mohammadi J, Aved A, Ferris D, Cruz EA, Morrone P (2024) Data poisoning: an overlooked threat to power grid resilience, Preprint at https://arxiv.org/abs/2407.14684
Ahmed, Z; Alexander, G; Ali, IM; Sutharshan, R. Non-intrusive load monitoring approaches for disaggregated energy sensing: a survey. Sensors; 2012; 12,
Ahmed S, Yang Z, Mathias H, Pascal B, Mario F, Michael B. (2019) Ml-leaks: model and data independent membership inference attacks and defenses on machine learning models. In Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS)
Ahmed, A; Wassim, H; Ahmed, FS; Olivier, D. Adversarial example detection for dnn models: a review and experimental comparison. Artif Intell Rev; 2022; 55,
Alayrac J-B, Uesato J, Huang P-S, Fawzi A, Stanforth R, Kohli P (2019) Are labels required for improving adversarial robustness? Adv Neural Inform Process Syst, 32
Alexander, Z; Dmitrii, U; Rickmer, B; Marcus, M; Daniel, R; Georgios, K. Medical imaging deep learning with differential privacy. Sci Rep; 2021; 11,
Alexey K, Goodfellow IJ, Samy B (2018) Adversarial examples in the physical world. Artif Intell Safety Secur. https://doi.org/10.1201/9781351251389-8
Altstidl T, Dobre D, Eskofier B, Gidel G, Schwinn L (2023) Raising the bar for certified adversarial robustness with diffusion models. Preprint at https://arxiv.org/abs/2305.10388
Amini S, Teymoorianfard M, Ma S, Houmansadr A (2024) Meansparse: post-training robustness enhancement through mean-centered feature sparsification. Preprint at https://arxiv.org/abs/2406.05927
Andriushchenko M, Croce F, Flammarion N, Hein M (2020) Square attack: a query-efficient black-box adversarial attack via random search. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision—ECCV 2020, pp. 484–501, Cham. Springer International Publishing. ISBN 978-3-030-58592-1
Anonymous (2022) Towards bridging the gap between empirical and certified robustness against adversarial examples. Submitted to Transactions on Machine Learning Research, https://openreview.net/forum?id=AhXLqWh0LH.
Apostolidis, KD; Papakostas, GA. A survey on adversarial deep learning robustness in medical image analysis. Electronics; 2021; [DOI: https://dx.doi.org/10.3390/electronics10172132]
Arpit D, Jastrzębski S, Ballas Ns, Krueger D, Bengio E, Kanwal MS, Maharaj T, Fischer A, Courville A, Bengio Y, Lacoste-Julien (2017) A closer look at memorization in deep networks. In Doina P and Yee WT, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 233–242. PMLR, 06–11. https://proceedings.mlr.press/v70/arpit17a.html
Athalye A, Engstrom L, Ilyas A, Kwok K (2018) Synthesizing robust adversarial examples. In Jennifer D and Andreas K, eds, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 284–293. PMLR, 10–15. URL https://proceedings.mlr.press/v80/athalye18b.html
Atzmon M, Haim N, Yariv L, Israelov O, Maron H, Lipman Y (2019) Controlling neural level sets. Adv Neural Inform Process Syst, 32
Bagdasaryan E, Veit A, Hua Y, Estrin D, Shmatikov V (2020) How to backdoor federated learning. In Silvia C and Roberto C, eds, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 2938–2948. PMLR, 26–28. https://proceedings.mlr.press/v108/bagdasaryan20a.html
Bai Y, Jones A, Ndousse K, Askell A, Chen A, DasSarma N, Drain D, Fort S, Ganguli D, Henighan T et al. (2022) Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv e-prints, pages arXiv–2204
Bai Y, Anderson BG, Kim A, Sojoudi S (2024a) Improving the accuracy-robustness trade-off of classifiers via adaptive smoothing. Preprint at https://arxiv.org/abs/2301.12554
Bai Y, Zhou M, Patel VM, Sojoudi S (2024b) Mixednuts: training-free accuracy-robustness balance via nonlinearly mixed classifiers. Preprint at https://arxiv.org/abs/2402.02263
Bartoldson BR, Diffenderfer J, Parasyris K, Kailkhura B (2024) Adversarial robustness limits via scaling-law and human-alignment studies. Preprint at https://arxiv.org/abs/2404.09349
Battista, B; Fabio, R. Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recogn; 2018; 84, pp. 317-331. [DOI: https://dx.doi.org/10.1016/j.patcog.2018.07.023]
Blika A, Palmos S, Doukas G, Lamprou V, Pelekis S, Kontoulis M, Ntanos C, Askounis D (2024) Federated learning for enhanced cybersecurity and trustworthiness in 5g and 6g networks: a comprehensive survey. IEEE Open J Commun Soc
Boloor A, He X, Gill C, Vorobeychik Y, Zhang X (2019) Simple physical adversarial examples against end-to-end autonomous driving models. In 2019 IEEE International Conference on Embedded Software and Systems (ICESS), pp. 1–7. https://doi.org/10.1109/ICESS.2019.8782514
Boloor, A; Garimella, K; He, X; Gill, C; Vorobeychik, Y; Zhang, X. Attacking vision-based perception in end-to-end autonomous driving models. J Syst Architecture; 2020; 110, 101766. [DOI: https://dx.doi.org/10.1016/j.sysarc.2020.101766]
Borgnia E, Cherepanova V, Fowl L, Ghiasi A, Geiping J, Goldblum M, Goldstein T, Gupta A (2020) Strong data augmentation sanitizes poisoning and backdoor attacks without an accuracy tradeoff. Preprint at https://arxiv.org/abs/2011.09527
Bos, JW; Lauter, K; Naehrig, M. Private predictive analysis on encrypted medical data. J Biomed Inform; 2014; 50, pp. 234-243. [DOI: https://dx.doi.org/10.1016/j.jbi.2014.04.003]
Brendel W, Rauber J, Bethge M (2018) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. In International Conference on Learning Representations. https://openreview.net/forum?id=SyZI0GWCZ
Brisimi, TS; Chen, R; Mela, T; Olshevsky, A; Paschalidis, IC; Shi, W. Federated learning of predictive models from federated electronic health records. Int J Med Inform; 2018; 112, pp. 59-67. [DOI: https://dx.doi.org/10.1016/j.ijmedinf.2018.01.007]
Buckman J, Roy A, Raffel C, Goodfellow I (2018) Thermometer encoding: one hot way to resist adversarial examples. In International Conference on Learning Representations. https://openreview.net/forum?id=S18Su--CW
Calivá, F; Cheng, K; Shah, R; Pedoia, V. Adversarial robust training of deep learning mri reconstruction models. Mach Learn Biomed Imaging; 2021; 1, pp. 1-32. [DOI: https://dx.doi.org/10.59275/j.melba.2021-df47]
Carlini, N; Wagner, D. Towards evaluating the robustness of neural networks. 2017 IEEE Symp Secur Privacy (SP); 2017; 10, 123. [DOI: https://dx.doi.org/10.1109/SP.2017.49]
Carlini N, Wagner D (2017b) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, page 3-14, New York, NY, USA. Association for Computing Machinery. ISBN 9781450352024. https://doi.org/10.1145/3128572.3140444
Carlini N, Athalye A, Papernot N, Brendel W, Rauber J (2019) Dimitris Tsipras. Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness, Ian Goodfellow
Carlini N, Hayes J, Nasr M, Jagielski M, Sehwag V, Tramèr F, Balle B, Ippolito D, Wallace E (2023) Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253–5270, Anaheim, CA. USENIX Association. ISBN 978-1-939133-37-3. https://www.usenix.org/conference/usenixsecurity23/presentation/carlini
Carmon Yair, Raghunathan Aditi, Schmidt Ludwig, Duchi John C, Liang Percy S (2019) Unlabeled data improves adversarial robustness. Adv Neural Inform Process Syst, 32
Carmon Y, Raghunathan A, Schmidt L, Liang P, Duchi JC (2022) Unlabeled data improves adversarial robustness, Preprint at https://arxiv.org/abs/1905.13736
Chao P, Robey A, Dobriban E, Hassani H, Pappas GJ, Wong E (2023) Jailbreaking black box large language models in twenty queries. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models. URL https://openreview.net/forum?id=rYWD5TMaLj
Chawin S, Arjun Nitin B, Arsalan M, Mung C, Prateek M (2018a) Deceiving autonomous cars with toxic signs, Darts
Chawin S, Arjun Nitin B, Arsalan M, Mung C, Prateek M (2018b) Deceiving traffic sign recognition with malicious ads and logos, Rogue signs
Chen P-Y, Zhang H, Sharma Y, Yi J, Hsieh C-J (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, page 15-26, New York, NY, USA. Association for Computing Machinery. ISBN 9781450352024. https://doi.org/10.1145/3128572.3140448
Chen B, Carvalho W, Baracaldo N, Ludwig H, Edwards B, Lee T, Molloy I, Srivastava B (2018a) Detecting backdoor attacks on deep neural networks by activation clustering. Preprint at arXiv:1811.03728
Chen B, Carvalho W, Baracaldo N, Ludwig H, Edwards B, Lee T, Molloy I, Srivastava B (2018b) Detecting backdoor attacks on deep neural networks by activation clustering. Preprint at https://arxiv.org/abs/1811.03728
Chen P-Y, Sharma Y, Zhang H, Yi J, Hsieh C-J (2018c) Ead: elastic-net attacks to deep neural networks via adversarial examples. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press. ISBN 978-1-57735-800-8
Chen Y, Tan Y, Deka D (2018d) Is machine learning in power systems vulnerable? In 2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), pp. 1–6. https://doi.org/10.1109/SmartGridComm.2018.8587547
Chen S-T, Cornelius C, Martin J, Horng (Polo) CD (2019a) Shapeshifter: robust physical adversarial attack on faster r-cnn object detector. In Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim, editors, Machine Learning and Knowledge Discovery in Databases, pp. 52–68, Cham. Springer International Publishing. ISBN 978-3-030-10925-7
Chen Y, Tan Y, Zhang B (2019b) Exploiting vulnerabilities of load forecasting through adversarial attacks. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, e-Energy ’19, pp. 1-11, New York, Association for Computing Machinery. ISBN 9781450366717. https://doi.org/10.1145/3307772.3328314
Chen C, Qin C, Qiu H, Ouyang C, Wang S, Chen L, Tarroni G, Bai W, Rueckert D (2020a) Realistic adversarial data augmentation for mr image segmentation. In Anne L. Martel, Purang Abolmaesumi, Danail Stoyanov, Diana Mateus, Maria A. Zuluaga, S. Kevin Zhou, Daniel Racoceanu, and Leo Joskowicz, editors, Medical Image Computing and Computer Assisted Intervention– MICCAI 2020, pp. 667–677, Cham. Springer International Publishing. ISBN 978-3-030-59710-8
Chen J, Jordan MI, Wainwright MJ (2020b) Hopskipjumpattack: a query-efficient decision-based attack. In 2020 IEEE Symposium on Security and Privacy (SP), pp. 1277–1294. https://doi.org/10.1109/SP40000.2020.00045
Cheng K, Calivá F, Shah R, Han M, Majumdar S, Pedoia V (2020) Addressing the false negative problem of deep learning mri reconstruction models by adversarial attacks and robust training. In Tal A, Ismail BA, Marleen de B, Maxime D, Herve L, and Christopher P (eds), Proceedings of the Third Conference on Medical Imaging with Deep Learning, volume 121 of Proceedings of Machine Learning Research, pp. 121–135. PMLR, 06–08. URL https://proceedings.mlr.press/v121/cheng20a.html
Cisse M, Adi Y, Neverova N, Keshet J (2017a) Houdini: fooling deep structured visual and speech recognition models with adversarial examples. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6980-6990, Red Hook. Curran Associates Inc
Cisse M, Bojanowski P, Grave E, Dauphin Y, Usunier N (2017b) Parseval networks: improving robustness to adversarial examples. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 854–863 https://proceedings.mlr.press/v70/cisse17a.html
Clarysse J, Hörrmann J, Yang F (2023) Why adversarial training can hurt robust accuracy. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=-CA8yFkPc7O
Cohen J, Rosenfeld E, Kolter JZ (2019a) Certified adversarial robustness via randomized smoothing. Preprint at http://arxiv.org/abs/1902.02918
Cohen J, Rosenfeld E, Kolter Z (2019b) Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, eds, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1310–1320. PMLR, 09–15. https://proceedings.mlr.press/v97/cohen19c.html
Commission European (2024) Eu artificial intelligence act | final draft. 1. URL https://artificialintelligenceact.eu/the-act/
Cretu GF, Stavrou A, Locasto ME, Stolfo SJ, Keromytis AD (2008) Casting out demons: sanitizing training data for anomaly sensors. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pp. 81–95. https://doi.org/10.1109/SP.2008.11
Croce F, Hein M (2020a) Minimally distorted adversarial examples with a fast adaptive boundary attack. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2196–2205. PMLR, 13–18. URL https://proceedings.mlr.press/v119/croce20a.html
Croce F, Hein M (2020b) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2206–2216. PMLR, 13–18. https://proceedings.mlr.press/v119/croce20b.html
Croce F, Andriushchenko M, Sehwag V, Debenedetti E, Flammarion N, Chiang M, Mittal P, Hein M (2020) Robustbench: a standardized adversarial robustness benchmark. https://github.com/RobustBench/robustbench
Croce F, Andriushchenko M, Sehwag V, Debenedetti E, Flammarion N, Chiang M, Mittal P, Hein M (2021) Robustbench: a standardized adversarial robustness benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=SSKZPJCt7B
Cui, L; Qu, Y; Gao, L; Xie, G; Yu, S. Detecting false data attacks using machine learning techniques in smart grid: a survey. J Network Comput Appl; 2020; 170, 102808. [DOI: https://dx.doi.org/10.1016/j.jnca.2020.102808]
Cui J, Tian Z, Zhong Z, Qi X, Yu B, Zhang H (2024) Decoupled kullback-leibler divergence loss. Preprint at https://arxiv.org/abs/2305.13948
Das N, Shanbhogue M, Chen S-T, Hohman F, Chen L, Kounavis ME (2017) Protecting and vaccinating deep learning with jpeg compression, Keeping the bad guys out
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-FL (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE
Deng Y, Zheng X, Zhang T, Chen C, Lou G, Kim M (2020) An analysis of adversarial attacks and defenses on autonomous driving models. In 2020 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 1–10. https://doi.org/10.1109/PerCom45495.2020.9127389
Di, F; Ali, H; Waslander Steven, L; Klaus, D. A review and comparative study on probabilistic object detection in autonomous driving. IEEE Trans Intell Transp Syst; 2022; 23,
Ding GW, Sharma Y, Lui KYC, Huang R (2018) Mma training: direct input space margin maximization through adversarial training. Preprint at arXiv:1812.02637
Ding GW, Wang L, Jin X (2019a) advertorch v0.1: an adversarial robustness toolbox based on pytorch
Ding S, Tian Y, Xu F, Li Q, Zhong S (2019b) Trojan attack on deep generative models in autonomous driving. In Songqing Chen, Kim-Kwang Raymond Choo, Xinwen Fu, Wenjing Lou, and Aziz Mohaisen, editors, Security and Privacy in Communication Networks, pages 299–318, Cham. Springer International Publishing. ISBN 978-3-030-37228-6
Donghuan, Yao; Mi, Wen; Xiaohui, Liang; Zipeng, Fu; Kai, Zhang; Baojia, Yang. Energy theft detection with energy privacy preservation in the smart grid. IEEE Internet Things J; 2019; 6,
Dwork C (2006) Differential privacy. In Michele B, Bart P, Vladimiro S, Ingo W, (eds), Automata, Languages and Programming, pages 1–12, Berlin, Heidelberg. Springer Berlin Heidelberg. ISBN 978-3-540-35908-1
Eurostat (2024) Use of artificial intelligence in enterprises, https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Use_of_artificial_intelligence_in_enterprises
Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Xiao C, Prakash A, Kohno T, Song D (2018) Robust physical-world attacks on deep learning visual classification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1625–1634. https://doi.org/10.1109/CVPR.2018.00175
Fang M, Cao X, Jia J, Gong N (2020) Local model poisoning attacks to Byzantine-Robust federated learning. In 29th USENIX Security Symposium (USENIX Security 20), pp. 1605–1622. USENIX Association. https://www.usenix.org/conference/usenixsecurity20/presentation/fang
Ferruz, N; Schmidt, S; Höcker, B. A deep unsupervised language model for protein design. bioRxiv; 2022; [DOI: https://dx.doi.org/10.1101/2022.03.09.483666]
Finlayson SG, Chung HW, Kohane IS, Beam AL (2018) Adversarial attacks against medical deep learning systems. Preprint at arXiv:1804.05296. https://doi.org/10.48550/arXiv.1804.05296
Finlayson, SG; Bowers, JD; Joichi, I; Zittrain, JL; Beam, AL; Kohane, IS. Adversarial attacks on medical machine learning. Science; 2019; 363,
FitzGerald JGM, Hench C, Peris C, Mackie S, Rottmann K, Sanchez A, Nash A, Urbach L, Kakarala V, Singh R, Ranganath S, Crist L, Britan M, Leeuwis W, Tur G, Natarajan P(2023) Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. In ACL 2023. https://www.amazon.science/publications/massive-a-1m-example-multilingual-natural-language-understanding-dataset-with-51-typologically-diverse-languages
Fredrikson M, Lantz E, Jha S, Lin S, Page D, Ristenpart T (2014) Privacy in pharmacogenetics: An End-to-End case study of personalized warfarin dosing. In 23rd USENIX Security Symposium (USENIX Security 14), pp. 17–32, San Diego, USENIX Association. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/fredrikson_matthew
Fredrikson M, Jha S, Ristenpart T (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pp. 1322-1333, New York, Association for Computing Machinery. https://doi.org/10.1145/2810103.2813677
Galbally, J; McCool, C; Fierrez, J; Marcel, S; Ortega-Garcia, J. On the vulnerability of face verification systems to hill-climbing attacks. Pattern Recogn; 2010; 43,
Georgios K, Alexander Z, Jonathan P-P, Théo R, Dmitrii U, Andrew T, Ionésio L, Mancuso JV, Friederike J, Marc-Matthias S, Andreas S, Makowski MR, Daniel R, Braren RF (2021) Nature Mach Intell 3:473–484. https://doi.org/10.1038/s42256-021-00337-8
Goodfellow IJ, Shlens J, Szegedy C(2015) Explaining and harnessing adversarial examples
Google Research (2024a) Tensorflow federated. https://www.tensorflow.org/federated. Accessed 26 Sep 2024
Google Research (2024b) Tensorflow privacy | responsible ai toolkit. https://www.tensorflow.org/responsible_ai/privacy/guide. Accessed 26 Sep 2024
Gowal S, Dvijotham K, Stanforth R, Bunel R, Qin C, Uesato J, Arandjelovic R, Mann T, Kohli P (2018) On the effectiveness of interval bound propagation for training verifiably robust models. Preprint at arXiv:1810.12715
Gowal S, Qin C, Uesato J, Mann T, Kohli P (2021a) Uncovering the limits of adversarial training against norm-bounded adversarial examples. Preprint at https://arxiv.org/abs/2010.03593
Gowal S, Rebuffi S-A, Wiles O, Stimberg F, Calian DA, Mann TA (2021b) Improving robustness using generated data. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Adv Neural Inform Process Syst, 34, 4218–4233. Curran Associates, Inc.,. https://proceedings.neurips.cc/paper_files/paper/2021/file/21ca6d0cf2f25c4dbb35d8dc0b679c3f-Paper.pdf
Graepel T, Lauter K, Naehrig M (2013) Ml confidential: machine learning on encrypted data. In Taekyoung K, Mun-Kyu L, Daesung K, eds, Information Security and Cryptology – ICISC 2012, pages 1–21, Berlin, Heidelberg. Springer Berlin Heidelberg
Gu S, Rigazio L (2015) Towards deep neural network architectures robust to adversarial examples
Gu T, Dolan-Gavitt B, Garg S (2017) Badnets: Identifying vulnerabilities in the machine learning model supply chain. CoRR, Preprint athttp://arxiv.org/abs/1708.06733
Guo C, Rana M, Cisse M, van der ML (2018) Countering adversarial images using input transformations. In International Conference on Learning Representations. https://openreview.net/forum?id=SyJ7ClWCb
Guohua, Cheng; Hongli, Ji. Adversarial perturbation on mri modalities in brain tumor segmentation. IEEE Access; 2020; 8, pp. 206009-206015. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3030235]
Hai, H; Zhengyu, Z; Michael, B; Yun, S; Yang, Z. Composite backdoor attacks against large language models. Findings Assoc Comput Linguist: NAACL; 2024; 2024, pp. 1459-1472.
Hall AJ, Jay M, Cebere T, Cebere B, van der Veen Koen L, Muraru G, Xu T, Cason P, Abramson W, Benaissa A, Shah C, Aboudib A, Ryffel T, Prakash K, Titcombe T, Khare VK, Shang M, Junior I, Gupta A, Paumier J, Kang N, Manannikov V, Trask A(2021) Syft 0.5: A platform for universally deployable structured transparency
Hamon R, Junklewitz H, and Sanchez Martin JI (2020) Robustness and explainability of artificial intelligence. JRC Publications Repository, 1(KJ-NA-30040-EN-N (online)). ISSN 1831-9424
Hao, J; Tao, Y. Adversarial attacks on deep learning models in smart grids. Energy Rep; 2022; [DOI: https://dx.doi.org/10.1016/j.egyr.2021.11.026]
He W, Wei J, Chen X, Carlini N, Song D (2017) Adversarial example defense: Ensembles of weak defenses are not strong. In 11th USENIX Workshop on Offensive Technologies (WOOT 17), Vancouver, BC. USENIX Association. https://www.usenix.org/conference/woot17/workshop-program/presentation/he
He Z, Zhang T, Lee RB (2019) Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference, ACSAC ’19, page 148-162, New York, https://doi.org/10.1145/3359789.3359824
Hendrycks D, Mazeika M, Wilson D, Gimpel K (2018) Using trusted data to train deep networks on labels corrupted by severe noise. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/ad554d8c3b06d6b97ee76a2448bd7913-Paper.pdf
Hendrycks D, Dietterich T (2019) Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, https://openreview.net/forum?id=HJz6tiCqYm
Hendrycks D, Lee K, Mazeika M (2019) Using pre-training can improve model robustness and uncertainty. In International conference on machine learning, pp. 2712–2721. PMLR
High-Level Expert Group on AI (2019) Ethics guidelines for trustworthy ai | shaping europe’s digital future. 8 https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai
Hinton G, Vinyals O, Dean J (2015a) Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531
Hinton G, Vinyals O, Dean J (2015b) Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop. Preprint at http://arxiv.org/abs/1503.02531
Hu, H; Salcic, Z; Sun, L; Dobbie, G; Yu, PS; Zhang, X. Membership inference attacks on machine learning: a survey. ACM Comput Surv; 2022; 10, 123. [DOI: https://dx.doi.org/10.1145/3523273]
Huan, Z; Hongge, C; Chaowei, X; Bo, L; Mingyan, L; Duane, B; Cho-Jui, H. Robust deep reinforcement learning against adversarial perturbations on state observations. Adv Neural Inf Process Syst; 2020; 33, pp. 21024-21037.
Hubinger E, Denison C, Mu J, Lambert M, Tong M, MacDiarmid M, Lanham T, Ziegler DM, Maxwell T, Cheng N, et al (2024) Sleeper agents: training deceptive llms that persist through safety training. arXiv e-prints, pages arXiv–2401
Huma, R; Andreas, E; Rudolf, M. Learning, M; Extraction, K. Backdoor attacks in neural networks-a systematic evaluation on multiple traffic sign datasets. Andreas Holzinger, Peter Kieseberg, A Min Tjoa, and Edgar Weippl; 2019; Cham, Springer International Publishing: pp. 285-300.
Huq A, Pervin MT(2020) Analysis of adversarial attacks on skin cancer recognition. In 2020 International Conference on Data Science and Its Applications (ICoDSA), pp. 1–4. https://doi.org/10.1109/ICoDSA50139.2020.9212850
IBM (2023) Ibm global ai adoption index 2022. https://www.ibm.com/watson/resources/ai-adoption
IBM (2024) Ibm global ai adoption index 2023. URL https://newsroom.ibm.com/2024-01-10-Data-Suggests-Growth-in-Enterprise-Adoption-of-AI-is-Due-to-Widespread-Deployment-by-Early-Adopters
Ibrahim, Y; Ambareen, S. Avoiding occupancy detection from smart meter using adversarial machine learning. IEEE Access; 2021; 9, pp. 35411-35430. [DOI: https://dx.doi.org/10.1109/ACCESS.2021.3057525]
Jain N, Schwarzschild A, Wen Y, Somepalli G, Kirchenbauer J, Chiang P-y, Goldblum M, Saha A, Geiping J, Goldstein T(2023) Baseline defenses for adversarial attacks against aligned language models. arXiv e-prints, pages arXiv–2309
Jia J, Ahmed S, Michael B, Yang Z, Neil ZG(2019) Memguard: Defending against black-box membership inference attacks via adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS ’19, page 259-274, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3319535.3363201
Jiawei, S; Vargas, DV; Sakurai, K. One pixel attack for fooling deep neural networks. IEEE Trans Evol Comput; 2019; 23,
Jin G, Shen S, Zhang D, Dai F, Zhang Y(2019) Ape-gan: adversarial perturbation elimination with gan. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3842–3846. https://doi.org/10.1109/ICASSP.2019.8683044
Jing, W; Mingyi, Z; Ce, Z; Yipeng, L. Mehrtash Harandi, and Li Li; 2021; Performance evaluation of adversarial attacks, Discrepancies and solutions:
Juuti M, Szyller S, Marchal S, Asokan N (2019). Prada: protecting against dnn model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS &P), pp. 512–527. https://doi.org/10.1109/EuroSP.2019.00044
Kairouz, P; McMahan, HB; Avent, B; Bellet, A; Bennis, M; Bhagoji, AN; Bonawitz, K; Charles, Z; Cormode, G; Cummings, R; D’Oliveira, RG. Advances and open problems in federated learning. Found Trends Mach Learn; 2021; 14,
Kang H-B (2013) Various approaches for driver and driving behavior monitoring: A review. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops
Karakolis E, Pelekis S, Mouzakitis S, Markaki O, Papapostolou K, Korbakis G, Psarras J (2022) Artificial intelligence for next generation energy services across europe - the i-nergy project. In ES 2021 : 19th International Conference e-Society 2021, pp. 61–68. https://cordis.europa.eu/project/id/101016508
Karakolis E, Pelekis S, Mouzakitis S, Kormpakis G, Michalakopoulos V, Psarras J (2023) The i-nergy reference architecture for the provision of next generation energy services through artificial intelligence. In International Conferences e-Society 2023 and Mobile Learning 2023, pp. 95–102
Karan S, Azizi ST, Tao MS, Sara WJ, Won CH, NS, Ajay T, Heather C-L, Stephen P et al (2023) Publisher correction: Large language models encode clinical knowledge. Nature 620(7973):E19–E19
Katz G, Barrett C, Dill DL, Julian K, Kochenderfer MJ (2017a) Reluplex: an efficient smt solver for verifying deep neural networks. In Rupak Majumdar and Viktor Kunčak, editors, Computer Aided Verification, pages 97–117, Cham. Springer International Publishing
Katz G, Barrett C, Dill DL, Julian K, Kochenderfer MJ (2017b) Towards proving the adversarial robustness of deep neural networks. In Lukas B, Maryam K, Sven L, editors, Proceedings of the First Workshop on Formal Verification of Autonomous Vehicles (FVAV ’17), volume 257 of Electronic Proceedings in Theoretical Computer Science, pages 19–26. URL http://eptcs.web.cse.unsw.edu.au/paper.cgi?FVAV2017.3. Turin, Italy
Kloukiniotis, A; Papandreou, A; Lalos, A; Kapsalas, P; Nguyen, D-V; Moustakas, K. Countering adversarial attacks on autonomous vehicles using denoising techniques: a review. IEEE Open J Intell Transport Syst; 2022; 3, pp. 61-80. [DOI: https://dx.doi.org/10.1109/OJITS.2022.3142612]
Knott B, Venkataraman S, Hannun A, Sengupta S, Ibrahim M, van der Maaten L(2021) Crypten: Secure multi-party computation meets machine learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 34: 4961–4973. Curran Associates, Inc., https://proceedings.neurips.cc/paper_files/paper/2021/file/2754518221cfbc8d25c13a06a4cb8421-Paper.pdf
Kong Z, Guo J, Li A, Liu C(2020) Physgan: generating physical-world-resilient adversarial examples for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Kormpakis G, Kapsalis P, Alexakis K, Pelekis S, Karakolis E, Doukas H (2022) An advanced visualisation engine with role-based access control for building energy visual analytics. In 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA), pp. 1–8. IEEE. https://doi.org/10.1109/IISA56318.2022.9904353. https://ieeexplore.ieee.org/document/9904353/
Kormpakis G, Kapsalis P, Alexakis K, Mylona Z, Pelekis S, Marinakis V (2023) Energy sector digitilisation: A security framework application for role-based access management. In 2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA), pp. 1–10. https://doi.org/10.1109/IISA59645.2023.10345842
Kotia J, Kotwal A, Bharti R (2020) Risk susceptibility of brain tumor classification to adversarial attacks. In Aleksandra G, Tadeusz C, Sebastian D, Katarzyna H, Agnieszka P, eds, Man-Machine Interactions 6, pp. 181–187, Cham. Springer International Publishing
Koundinya AK, Patil S S, Chandu B R (2024) Data poisoning attacks in cognitive computing. In 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), pp. 1–4. https://doi.org/10.1109/I2CT61223.2024.10544345
Kousik, B; Sanjay, M. Adversarial attack defense analysis: an empirical approach in cybersecurity perspective. Softw Impacts; 2024; 21, 100681. [DOI: https://dx.doi.org/10.1016/j.simpa.2024.100681]
Kumar RSS (2019) David O Brien. Salomé Viljöen, and Jeffrey Snover. Failure modes in machine learning systems, Kendra Albert
Kumar, P. Adversarial attacks and defenses for large language models (llms): methods, frameworks & challenges. Int J Multimed Inform Retrieval; 2024; 13, pp. 1-28.
Kurakin A, Goodfellow IJ, Bengio S(2017) Adversarial machine learning at scale. In International Conference on Learning Representations. https://openreview.net/forum?id=BJm4T4Kgx
Kurita K, Michel P, Neubig G(2020) Weight poisoning attacks on pretrained models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2793–2806, Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.249. URL https://aclanthology.org/2020.acl-main.249
Kügler D, Bucher A, Kleemann J, Distergoft A, Jabhe A, Uecker M, Kazeminia S, Fauser J, Alte D, Rajkarnikar A, Kuijper A, Weberschock T, Meissner M, Vogl T, Mukhopadhyay A (2019) Physical attacks in dermoscopy: an evaluation of robustness for clinical deep-learning. https://openreview.net/forum?id=Byl6W7WeeN
Lecuyer M, Atlidakis V, Geambasu R, Hsu D, Jana S(2019) Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656–672. https://doi.org/10.1109/SP.2019.00044
Lederer I, Mayer R, Rauber A(2023) Identifying appropriate intellectual property protection mechanisms for machine learning models: a systematization of watermarking, fingerprinting, model access, and attacks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–19. https://doi.org/10.1109/TNNLS.2023.3270135
Lee K, Lee K, Lee H, Shin J(2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/abdeb6f575ac5c6676b747bca8d09cc2-Paper.pdf
Li, Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Process Mag; 2012; 29,
Li J (2019) Robustness in machine learning (cse 599-m). https://jerryzli.github.io/robust-ml-fall19, Access 29 Jan 2024
Li, D; Liao, X; Xiang, T; Wu, J; Le, J. Privacy-preserving self-serviced medical diagnosis scheme based on secure multi-party computation. Comput Secur; 2020; 90, 101701. [DOI: https://dx.doi.org/10.1016/j.cose.2019.101701]
Li J, Yang Y, Sun JS (2020b) Searchfromfree: Adversarial measurements for machine learning-based energy theft detection. In 2020 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), pp. 1–6, https://doi.org/10.1109/SmartGridComm47815.2020.9303013
Li L, Xie T, Li B (2020c) Ai-secure/verigauge: a united toolbox for running major robustness verification approaches for dnns., https://github.com/AI-secure/VeriGauge
Li S, Cheng Y, Wang W, Liu Y, Chen T (2020d) Learning to detect malicious clients for robust federated learning. CoRR, Preprint at arXiv:2002.00211
Li L, Song D, Li X, Zeng J, Ma R, Qiu X (2021) Backdoor attacks on pre-trained models by layerwise weight poisoning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3023–3032, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.241. https://aclanthology.org/2021.emnlp-main.241
Li, Y; Wei, X; Li, Y; Dong, Z; Shahidehpour, M. Detection of false data injection attacks in smart grid: a secure federated deep learning approach. IEEE Trans Smart Grid; 2022; 13,
Li J, Yang Y, Sun JS, Tomsovic K, Qi H (2023a) Towards adversarial-resilient deep neural networks for false data injection attack detection in power grids. In 2023 32nd International Conference on Computer Communications and Networks (ICCCN), pp. 1–10, https://doi.org/10.1109/ICCCN58024.2023.10230180
Li L, Xie T, Li B (2023b) Sok: certified robustness for deep neural networks. In 2023 IEEE Symposium on Security and Privacy (SP), pp. 1289–1310, https://doi.org/10.1109/SP46215.2023.10179303
Li L, Xie T, Li B (2023c) Sok: certified robustness for deep neural networks. In 2023 IEEE Symposium on Security and Privacy (SP), pp. 1289–1310. IEEE Computer Society
Li H, Chen Y, Zheng Z, Hu Q, Chan C, Liu H, Song Y (2024) Backdoor removal for generative large language models. arXiv e-prints, pages arXiv–2405
Liao F, Liang M, Dong Y, Pang T, Hu X, Zhu J (2018) Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Lingchen, Z; Shengshan, H; Qian, W; Jianlin, J; Chao, S; Xiangyang, L; Pengfei, H. Shielding collaborative learning: mitigating poisoning attacks through client-side detection. IEEE Trans Dependable Secure Comput; 2021; 18,
Liu X, Cheng M, Zhang H, Hsieh C-J (2018) Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV)
Liu R, Yuan Z, Liu T, Xiong Z (2021) End-to-end lane shape prediction with transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3694–3702
Liu C, Dong Y, Xiang W, Yang X, Su H, Zhu J, Chen Y, He Y, Xue H, Zheng S (2023) A comprehensive study on robustness of image classification models: benchmarking and rethinking, Preprint at https://arxiv.org/abs/2302.14301
Liu X, Xu N, Chen M, Xiao C (2024a) AutoDAN: generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations, Preprint https://openreview.net/forum?id=7Jwpw4qKkb
Liu Y, Cong T, Zhao Z, Backes M, Shen Y, Zhang Y (2024b) Robustness over time: Understanding adversarial examples’ effectiveness on longitudinal versions of large language models, https://openreview.net/forum?id=eC4WlSZc4H
Lu J, Issaranon T, Forsyth D (2017) Safetynet: detecting and rejecting adversarial examples robustly. In Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Lukas N, Zhang Y, Kerschbaum F(2021) Deep neural network fingerprinting by conferrable adversarial examples. In International Conference on Learning Representations, https://openreview.net/forum?id=VqzVhqxkjH1
Lyu Z, Guo M, Wu T, Xu G, Zhang K, Lin D(2021) Towards evaluating and training verifiably robust neural networks, Preprint at https://arxiv.org/abs/2104.00447
Machado, GR; Silva, E; Goldschmidt, RR. Adversarial machine learning in image classification: a survey toward the defender’s perspective. ACM Comput Surv; 2021; [DOI: https://dx.doi.org/10.1145/3485133]
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, https://openreview.net/forum?id=rJzIBfZAb
Maini P, Yaghini M, Papernot N (2021) Dataset inference: ownership resolution in machine learning. In International Conference on Learning Representations, https://openreview.net/forum?id=hvdKKV2yt7T
Mansi, G; Junho, H; John, M. Cybersecurity of autonomous vehicles: a systematic literature review of adversarial attacks and defense models. IEEE Open J Vehicular Technol; 2023; 4, pp. 417-437. [DOI: https://dx.doi.org/10.1109/OJVT.2023.3265363]
Mehran, Mozaffari-Kermani; Susmita, Sur-Kolay; Anand, Raghunathan; Jha Niraj, K. Systematic poisoning attacks on and defenses for machine learning in healthcare. IEEE J Biomed Health Inform; 2015; 19,
Meng D, Chen H (2017) Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pp. 135-147, New York, NY, USA, Association for Computing Machinery. https://doi.org/10.1145/3133956.3134057
Metzen JH, Genewein T, Fischer V, Bischoff B (2017) On detecting adversarial perturbations. In International Conference on Learning Representations, https://openreview.net/forum?id=SJzCSf9xg
Min S, Krishna K, Lyu X, Lewis M, Yih Wt, Koh PW, Iyyer M, Zettlemoyer L, Hajishirzi H (2023) FActscore: fine-grained atomic evaluation of factual precision in long form text generation. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, https://openreview.net/forum?id=fhSTeAAVb6
Mohammad, A-R; Morris, CJ. Privacy-preserving machine learning: threats and solutions. IEEE Secur Privacy; 2019; 17,
Moosavi-Dezfooli S-M, Fawzi A, Frossard P (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Moosavi-Dezfooli S-M, Fawzi A, Fawzi O, Frossard P (2017) Universal adversarial perturbations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 86–94, https://doi.org/10.1109/CVPR.2017.17
Morris JX., Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y (2020) Textattack: a framework for adversarial attacks, data augmentation, and adversarial training in nlp
Muller MN, Eckert F, Fischer M, Vechev M(2023) Certified training: small boxes are all you need, Perprint at https://arxiv.org/abs/2210.04871
Nadji B(Brad) (2024) Data Security, Integrity, and Protection, pp. 59–83. Springer Nature Switzerland, Cham, https://doi.org/10.1007/978-3-031-61117-9_4
Niazazari I, Livani H(2020) Attack on grid event cause analysis: An adversarial machine learning approach. In 2020 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT), pp. 1–5,https://doi.org/10.1109/ISGT45199.2020.9087649
Nicolae M-I, Sinn M, Tran MN, Buesser B, Rawat A, Wistuba M, Zantedeschi V, Baracaldo N, Chen B, Ludwig H, Molloy I, Edwards B (2018) Adversarial robustness toolbox v1.2.0. Preprint at arXiv:1807.01069
Oliynyk, D; Mayer, R; Rauber, A. I know what you trained last summer: a survey on stealing machine learning models and defences. ACM Comput Surv; 2023; [DOI: https://dx.doi.org/10.1145/3595292]
Omitaomu, OA; Niu, H. Artificial intelligence techniques in smart grid: a survey. Smart Cities; 2021; 4,
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano PF, Leike J, Lowe R(2022) Training language models to follow instructions with human feedback. In Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A editors, Advances in Neural Information Processing Systems, volume 35, pp. 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
Pal, B; Gupta, D; Rashed-Al-Mahfuz, Md; Alyami, SA; Moni, MA. Vulnerability in deep transfer learning models to adversarial fast gradient sign attack for covid-19 prediction from chest radiography images. Appl Sci; 2021; [DOI: https://dx.doi.org/10.3390/app11094233]
Palma AD, Bunel R, Dvijotham K, Pawan KM, Stanforth R (2023) Ibp regularization for verified adversarial robustness via branch-and-bound, Preprint at https://arxiv.org/abs/2206.14772
Palma AD, Bunel R, Dvijotham K, Pawan KM., Stanforth R, Lomuscio A (2024) Expressive losses for verified robustness via convex combinations, Preprint at arXiv:2305.13991
Pang T, Xu K, Du C, Chen N, Zhu J (2019) Improving adversarial robustness via promoting ensemble diversity. In K Chaudhuri, R Salakhutdinov eds, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4970–4979. https://proceedings.mlr.press/v97/pang19a.html
Pang T, Lin M, Yang X, Zhu J, Yan S (2022) Robustness and accuracy could be reconcilable by (Proper) definition. In K Chaudhuri, S Jegelka, L Song, C Szepesvari, G Niu, S Sabato, eds, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 17258–17277. https://proceedings.mlr.press/v162/pang22a.html
Papakipos Z, Bitton J (2022) Augly: data augmentations for robustness
Papernot N, McDaniel P, Jha S, Fredrikson M, Berkay CZ, Swami A (2016a) The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS &P), pp. 372–387, https://doi.org/10.1109/EuroSP.2016.36
Papernot N, McDaniel P,Wu X, Jha S, Swami A (2016b) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597, https://doi.org/10.1109/SP.2016.41
Papernot N, McDaniel P, Goodfellow I, Jha S, Berkay CZ, Swami A (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pp. 506-519, New York, NY, USA, Association for Computing Machinery. https://doi.org/10.1145/3052973.3053009
Papernot N, Faghri F, Carlini N, Goodfellow I, Feinman R, Kurakin A, Xie C, Sharma Y, Brown T, Roy A, Matyasko A, Behzadan V, Hambardzumyan K, Zhang Z, Juang Y-L, Li Z, Sheatsley R, Garg A, Uesato J, Gierke W, Dong Y, Berthelot D, Hendricks P, Rauber J, Long R, McDaniel P (2018a) Technical report on the cleverhans v2.1.0 adversarial examples library
Papernot N, McDaniel P, Sinha A, Wellman MP (2018b) Sok: security and privacy in machine learning. In 2018 IEEE European Symposium on Security and Privacy (EuroS &P), pp. 399–414, https://doi.org/10.1109/EuroSP.2018.00035
Park H, Bayat A, Sabokrou M, Kirschke JS, Menze BH (2020) Robustification of segmentation models against adversarial perturbations in medical imaging. In R Islem, A Ehsan, P Sang Hyun, V Hernández Maria del C, eds, Predictive Intelligence in Medicine, pages 46–57, Cham, Springer International Publishing
Paschali M, Conjeti S, Navarro F, Navab N(2018) Generalizability vs. robustness: investigating medical imaging networks using adversarial examples. In Frangi Alejandro F, Schnabel Julia A, Davatzikos Christos, Alberola-López Carlos, Fichtinger Gabor, editors, Medical Image Computing and Computer Assisted Intervention—MICCAI 2018, pp. 493–501, Cham, Springer International Publishing
Paudice A, Muñoz-GL, Lupu EC (2018) Label sanitization against label flipping poisoning attacks, Preprint at https://arxiv.org/abs/1803.00992
Pelekis S, Karakolis E, Silva F, Schoinas V, Mouzakitis S, Kormpakis G, Amaro N, Psarras J (2022) In search of deep learning architectures for load forecasting: A comparative analysis and the impact of the covid-19 pandemic on model performance. In 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA), pp. 1–8. https://doi.org/10.1109/IISA56318.2022.9904363. https://ieeexplore.ieee.org/document/9904363/
Pelekis, S; Pipergias, A; Karakolis, E; Mouzakitis, S; Santori, F; Ghoreishi, M; Askounis, D. Targeted demand response for flexible energy communities using clustering techniques. Sustain Energy Grids Netw; 2023; 36, 101134. [DOI: https://dx.doi.org/10.1016/J.SEGAN.2023.101134]
Pelekis S, Seisopoulos I-K, Spiliotis E, Pountridis T, Karakolis E, Mouzakitis S, Askounis D (2023b) A comparative assessment of deep learning models for day-ahead load forecasting: investigating key accuracy drivers. sustainable energy, grids and networks, 36:101171. https://doi.org/10.1016/J.SEGAN.2023.101171. https://linkinghub.elsevier.com/retrieve/pii/S2352467723001790
Pelekis S, Pountridis T, Kormpakis G, Lampropoulos G, Karakolis E, Mouzakitis S, Askounis D (2024) Deeptsf: codeless machine learning operations for time series forecasting. SoftwareX, 27:101758. https://doi.org/10.1016/J.SOFTX.2024.101758. https://linkinghub.elsevier.com/retrieve/pii/S2352711024001298
Peng S, Xu W, Cornelius C, Hull M, Li K, Duggal R, Phute M, Martin J, Chau DH (2023) Robust principles: architectural design principles for adversarially robust cnns. Preprint at https://arxiv.org/abs/2308.16258
Qayyum, A; Usama, M; Qadir, J; Al-Fuqaha, A. Securing connected & autonomous vehicles: challenges posed by adversarial machine learning and the way forward. IEEE Commun Surv Tutorials; 2020; 22,
Qi F, Chen Y, Li M, Yao Y, Liu Z, Sun M(2021) ONION: a simple and effective defense against textual backdoor attacks. In Marie-Francine M, Xuanjing H, Lucia S, Scott Wen-tau Y, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9558–9566, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.752. https://aclanthology.org/2021.emnlp-main.752
Qiang Y, Zhou X, Zade SZ, Roshani MA, Zytko D, Zhu D (2024) Learning to poison large language models during instruction tuning. CoRR, Preprint at https://doi.org/10.48550/arXiv.2402.13459
Qin C, Martens J, Gowal S, Krishnan D, Dvijotham K, Fawzi A, De S, Stanforth R, Kohli P(2019) Adversarial robustness through local linearization. Adv Neural Inform Process Syst, 32
Qureshi NBS, Kim D-H, Lee J, Lee E-K (2022) Poisoning attacks against federated learning in load forecasting of smart energy. In NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, pp. 1–7. https://doi.org/10.1109/NOMS54207.2022.9789884
Raghunathan A, Steinhardt J, Liang P (2018) Certified defenses against adversarial examples. In International Conference on Learning Representations. https://openreview.net/forum?id=Bys4ob-Rb
Raghunathan* A, Xie* SM, Yang F, Duchi J, Liang P (2019) Adversarial training can hurt generalization. In ICML 2019 Workshop on Identifying and Understanding Deep Learning Phenomena. https://openreview.net/forum?id=SyxM3J256E
Raghunathan A, Xie SM, Yang F, Duchi J, Liang P (2020) Understanding and mitigating the tradeoff between robustness and accuracy. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7909–7919. PMLR, 13–18. URL https://proceedings.mlr.press/v119/raghunathan20a.html
Ran, Y; Zhang, A-X; Li, M; Tang, W; Wang, Y-G. Black-box adversarial attacks against image quality assessment models. Expert Syst Appl; 2025; [DOI: https://dx.doi.org/10.1016/j.eswa.2024.125415]
Rauber J, Brendel W(2018) and Matthias Bethge. A python toolbox to benchmark the robustness of machine learning models, Foolbox
Raza, MQ; Khosravi, A. A review on artificial intelligence based load demand forecasting techniques for smart grid and buildings. Renewable Sustain Energy Rev; 2015; 50, pp. 1352-1372. [DOI: https://dx.doi.org/10.1016/j.rser.2015.04.065]
Rebuffi S-A, Gowal S, Calian DA, Stimberg F, Wiles O, Mann T (2021a) Fixing data augmentation to improve adversarial robustness. Preprint at https://arxiv.org/abs/2103.01946
Rebuffi S-A, Gowal S, Calian DA, Stimberg F, Wiles O, Mann TA (2021b) Fixing data augmentation to improve adversarial robustness. CoRR, Preprint at https://arxiv.org/abs/2103.01946
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. https://doi.org/10.1109/CVPR.2016.91
Rice L, Wong E, Kolter Z (2020) Overfitting in adversarially robust deep learning. In International conference on machine learning, pp. 8093–8104
Rieke, N; Hancox, J; Li, W; Milletarì, F; Roth, HR; Albarqouni, S; Bakas, S; Galtier, MN; Landman, BA; Maier-Hein, K; Ourselin, S; Sheller, M; Summers, RM; Trask, A; Xu, D; Baust, M; Cardoso, MJ. The future of digital health with federated learning. npj Digital Med; 2020; 3,
Rigaki, M; Garcia, S. A survey of privacy attacks in machine learning. ACM Comput Surv; 2023; [DOI: https://dx.doi.org/10.1145/3624010]
Robey A, Wong E, Hassani H, Pappas G(2023) SmoothLLM: defending large language models against jailbreaking attacks. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models. https://openreview.net/forum?id=msOSDvY4Ss
RobustBench (2024) Robustbench: A standardized benchmark for adversarial robustness. URL https://robustbench.github.io/#leaderboard
Ros AS, Doshi-Velez F (2018) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press. ISBN 978-1-57735-800-8
Roshan, K; Zafar, A. Black-box adversarial transferability: an empirical study in cybersecurity perspective. Comput Secur; 2024; 141, 103853. [DOI: https://dx.doi.org/10.1016/j.cose.2024.103853]
Roth, H.R., Chang, K., Singh, P., Neumark, N., Li, W., Gupta, V., Gupta, S., Qu, L., Ihsani, A., Bizzo, B.C. and Wen, Y., (2020) Federated learning for breast density classification: A real-world implementation. In S Albarqouni, S Bakas, K Kamnitsas, M. Jorge Cardoso, B Landman, W Li, F Milletari, N Rieke, H Roth, D Xu, Z Xu, editors, Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning, pages 181–191, Cham. Springer International Publishing
Salman H, Yang G, Li J, Zhang P, Zhang H, Razenshteyn IP, Bubeck S (2019) Provably robust deep learning via adversarially trained smoothed classifiers. Preprint at http://arxiv.org/abs/1906.04584
Salman H, Ilyas A, Engstrom L, Kapoor A, Madry A (2020) Do adversarially robust imagenet models transfer better? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pp. 3533–3545. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/24357dd085d2c4b1a88a7e0692e60294-Paper.pdf
Samangouei P, Kabkab M, Chellappa R (2018) Defense-GAN: protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations. https://openreview.net/forum?id=BkJ3ibb0-
Sarkar S, Bansal A, Mahbub U (2017) and Rama Chellappa. Breaking high performance image classifiers, Upset and angri
Sayghe A, Anubi OM, Konstantinou C (2020a) Adversarial examples on power systems state estimation. In 2020 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT), pp. 1–5. https://doi.org/10.1109/ISGT45199.2020.9087789
Sayghe A, Zhao J, Konstantinou C (2020b) Evasion attacks with adversarial deep learning against power system state estimation. In 2020 IEEE Power & Energy Society General Meeting (PESGM), pages 1–5. https://doi.org/10.1109/PESGM41954.2020.9281719
Schuster R, Song C, Tromer E, Shmatikov V (2021) You autocomplete me: Poisoning vulnerabilities in neural code completion. In 30th USENIX Security Symposium (USENIX Security 21), pp. 1559–1575
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Sha Z, He X, Berrang P, Humbert M, Zhang Y (2024) Fine-tuning is all you need to mitigate backdoor attacks. https://openreview.net/forum?id=ywGSgEmOYb
Sharif M, Bhagavatula S, Bauer L, Reiter MK (2016) Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pp. 1528-1540, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/2976749.2978392
Shin T, Razeghi Y, Logan IV RL, Wallace E, Singh S (2020) AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222–4235, Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.346. https://aclanthology.org/2020.emnlp-main.346
Shokri R, Stronati M, Song C, Shmatikov V(2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. https://doi.org/10.1109/SP.2017.41
Singh ND, Croce F, Hein M(2023) Revisiting adversarial training for imagenet: Architectures, training and generalization across threat models. Preprint at https://arxiv.org/abs/2303.01870
Siqi, L; Adiyoso, S; Arnaud, A; Ghesu, FC; Eli, G; Sasa, G; Bogdan, G; Dorin, C. No surprises: training robust lung nodule detection for low-dose ct scans by augmenting with adversarial attacks. IEEE Trans. Med. Imag.; 2021; 40,
SoK (2023) Sok: Certified robustness for deep neural networks. https://sokcertifiedrobustness.github.io/
Sok (2024) Sok: Certified robustness for deep neural networks | leaderboard. https://sokcertifiedrobustness.github.io/leaderboard/
Song L, Mittal P (2021) Systematic evaluation of privacy risks of machine learning models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2615–2632. USENIX Association. https://www.usenix.org/conference/usenixsecurity21/presentation/song
Song D, Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Tramèr F, Prakash A, Kohno T (2018a) Physical adversarial examples for object detectors. In 12th USENIX Workshop on Offensive Technologies (WOOT 18), Baltimore, MD. USENIX Association. https://www.usenix.org/conference/woot18/presentation/eykholt
Song Y, Kim T, Nowozin S, Ermon S, Kushman N (2018b) Pixeldefend: leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations. https://openreview.net/forum?id=rJUYGxbCW
Song Q, Tan R, Ren C, Xu Y (2021) Understanding credibility of adversarial examples against smart grid: A case study for voltage stability assessment. In Proceedings of the Twelfth ACM International Conference on Future Energy Systems, e-Energy ’21, page 95-106, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3447555.3464859
Sorin, G; Bogdan, T; Tiberiu, C; Gigel, M. A survey of deep learning techniques for autonomous driving. J Field Robot; 2020; 37,
Steinhardt J, Koh Pang WW, Liang PS (2017) Certified defenses for data poisoning attacks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/9d7311ba459f9e45ed746755a32dcd11-Paper.pdf
Stempfel G, Ralaivola L (2009) Learning svms from sloppily labeled data. In Cesare Alippi, Marios Polycarpou, Christos Panayiotou, and Georgios Ellinas, editors, Artificial Neural Networks – ICANN 2009, pp. 884–893, Berlin, Heidelberg. Springer Berlin Heidelberg
Stracqualursi E, Rosato A, Di Lorenzo G, Panella M, Araneo R (2023) Systematic review of energy theft practices and autonomous detection through artificial intelligence methods. Renewable and Sustainable Energy Reviews 184. https://doi.org/10.1016/j.rser.2023.113544. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85165975999&doi=10.1016%2fj.rser.2023.113544&partnerID=40&md5=e525c7d89a25312128c39ed714863a05
T Stacey, L Ling, GM Emre, L Yu, W Wenqi (2021) Demystifying membership inference attacks in machine learning as a service. IEEE Trans Serv Comput, 14(6):2073–2089. https://doi.org/10.1109/TSC.2019.2897554
Szegedy C, Zaremba W, Sutskever I (2014) Joan Bruna. Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks, Dumitru Erhan
Taghanaki SA, Abhishek K, Azizi S, Hamarneh G (2019) A kernelized manifold mapping to diminish the effect of adversarial perturbations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
The G (2019) Tesla driver dies in first fatal crash while using autopilot mode. https://www.theguardian.com/technology/2016/jun/30/tesla-autopilot-death-self-driving-car-elon-musk
Tian J, Li T, Shang F, Cao K, Li J, Ozay M (2019) Adaptive normalized attacks for learning adversarial attacks and defenses in power systems. In 2019 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), pp. 1–6. https://doi.org/10.1109/SmartGridComm.2019.8909713
Tian B, Guo Q, Juefei-Xu F, Chan WL, Cheng Y, Li X, Xie X, Qin S (2021) Bias field poses a threat to dnn-based x-ray recognition. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. https://doi.org/10.1109/ICME51207.2021.9428437
Tian Z, Cui L, Liang J, Yu S (2022a) A comprehensive survey on poisoning attacks and countermeasures in machine learning. ACM Comput. Surv., 55(8).
Tian Z, Cui L, Liang J, Yu S (2022b) A comprehensive survey on poisoning attacks and countermeasures in machine learning. ACM Comput. Surv.,. https://doi.org/10.1145/3551636
Tom, B. Brown; 2018; Dandelion Mané, Martín Abadi, and Justin Gilmer. Adversarial patch, Aurko Roy:
Tomsett R, Chan K, Chakraborty S (2019) Model poisoning attacks against distributed machine learning systems. In Tien Pham, editor, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, page 110061D. International Society for Optics and Photonics, SPIE. https://doi.org/10.1117/12.2520275
Tramèr F (2017) Nicolas Papernot. Dan Boneh, and Patrick McDaniel. The space of transferable adversarial examples, Ian Goodfellow
Tramer F, Boneh D (2019) Adversarial training and robustness for multiple perturbations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/5d4ae76f053f8f2516ad12961ef7fe97-Paper.pdf
Tramèr F, Zhang F, Juels A, Reiter MK, Ristenpart T (2016) Stealing machine learning models via prediction APIs. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601–618, Austin, TX. USENIX Association. ISBN 978-1-931971-32-4. URL https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer
Tramèr F, Kurakin A, Papernot N, Goodfellow I, Boneh D, McDaniel P (2018) Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations. https://openreview.net/forum?id=rkZvSe-RZ
Tramer F, Carlini N, Brendel W, Madry A (2020) On adaptive attacks to adversarial example defenses. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pp. 1633–1645. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/11f38f8ecd71867b42433548d1078e38-Paper.pdf
Tran B, Li J, Madry A (2018) Spectral signatures in backdoor attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/280cf18baf4311c92aa5a042336587d3-Paper.pdf
Tripathi AM, Mishra A (2020) Fuzzy unique image transformation: Defense against adversarial attacks on deep covid-19 models, https://europepmc.org/article/PPR/PPR271151
Tsipras D, Santurkar S, Engstrom L, Turner A, Madry A (2019) Robustness may be at odds with accuracy. In International Conference on Learning Representations. https://openreview.net/forum?id=SyxAb30cY7
Tso R, Alelaiwi A, Mizanur SM Rahman Mu-En W, Shamim HM (2017) Privacy-preserving data communication through secure multi-party computation in healthcare sensor cloud. Journal of Signal Processing Systems 89:51–59
Tzortzis, AM; Pelekis, S; Spiliotis, E; Karakolis, E; Mouzakitis, S; Psarras, J; Askounis, D. Transfer learning for day-ahead load forecasting: a case study on european national electricity demand time series. Mathematics; 2024; [DOI: https://dx.doi.org/10.3390/math12010019]
Uwimana A, Senanayake R (2021) Out of distribution detection and adversarial attacks on deep neural networks for robust medical image analysis. In ICML 2021 Workshop on Adversarial Machine Learning. https://openreview.net/forum?id=1iy7rdPCt_
Vassilev A, Oprea A, Fordyce A, Andersen H(2024) Adversarial machine learning: a taxonomy and terminology of attacks and mitigations, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=957080
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Vatian A, Gusarova N, Dobrenko N, Dudorov S, Nigmatullin N, Shalyto A, Lobantsev A (2019) Impact of adversarial examples on the efficiency of interpretation and use of information from high-tech medical images. In 2019 24th Conference of Open Innovations Association (FRUCT), pp. 472–478. https://doi.org/10.23919/FRUCT.2019.8711974
Virat, S; Amir, H. Membership privacy for machine learning models through knowledge transfer. Proceed AAAI Conf Artif Intell; 2021; 35,
Wan A, Wallace E, Shen S, Klein D (2023) Poisoning language models during instruction tuning. In International Conference on Machine Learning, pp. 35413–35425. PMLR
Wang, J; Srikantha, P. Stealthy black-box attacks on deep learning non-intrusive load monitoring models. IEEE Transactions Smart Grid; 2021; 12,
Wang B, Xu C, Wang S, Gan Z, Cheng Y, Gao J, Awadallah AH, Li B (2021) Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=GF9cSKI3A_q
Wang G, Xie Y, Jiang Y, Mandlekar A, Xiao C, Zhu Y, Fan L, Anandkumar A (2023a) Voyager: an open-ended embodied agent with large language models. In Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023. https://openreview.net/forum?id=nfx5IutEed
Wang N, Luo Y, Sato T, Xu K, Chen QA (2023b) Does physical adversarial example really matter to autonomous driving? towards system-level effect of adversarial object evasion attack. Preprint at https://arxiv.org/abs/2308.11894
Wang Z, Pang T, Du C, Lin M, Liu W, Yan S (2023c) Better diffusion models further improve adversarial training. Preprint at https://arxiv.org/abs/2302.04638
Warde-Farley D, Goodfellow I (2016) Adversarial perturbations of deep neural networks. Perturbations, Optimization,Statistics, 311(5)
Wenbo, J; Hongwei, L; Sen, L; Xizhao, L; Rongxing, L. Poisoning and evasion attacks against deep learning algorithms in autonomous vehicles. IEEE Trans Veh Technol; 2020; 69,
Weng T-W, Zhang H, Chen P-Y, Yi J, Su D, Gao Y, Hsieh C-J, Daniel L (2018) Evaluating the robustness of neural networks: an extreme value theory approach. In International Conference on Learning Representations. https://openreview.net/forum?id=BkUHlMZ0b
Wolf, MJ; Miller, K; Grodzinsky, FS. Why we should have seen that coming: comments on microsoft’s tay experiment, and wider implications. SIGCAS Comput. Soc.; 2017; 47,
Wong E, Kolter Z (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5286–5295. PMLR, 10–15. URL https://proceedings.mlr.press/v80/wong18a.html
Wood, A; Najarian, K; Kahrobaei, D. Homomorphic encryption for machine learning in medicine and bioinformatics. ACM Comput Surv; 2020; [DOI: https://dx.doi.org/10.1145/3394658]
Wu, D; Liu, S; Ban, J. Classification of diabetic retinopathy using adversarial training. IOP Conf Ser: Mater Sci Eng; 2020; [DOI: https://dx.doi.org/10.1088/1757-899X/806/1/012050]
Wu D, Xia S-T, Wang Y (2020b) Adversarial weight perturbation helps robust generalization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pp. 2958–2969. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/1ef91c212e30e14bf125e9374262401f-Paper.pdf
Wu S, Irsoy O, Lu S, Dabravolski V, Dredze M, Gehrmann S, Kambadur P, Rosenberg D, Mann G (2023) Bloomberggpt: A large language model for finance. Preprint at arXiv:2303.17564
Wu Q, Bansal G, Zhang J, Wu Y, Li B, Zhu E, Jiang L, Zhang X, Zhang S, Liu J, Awadallah AH, White RW, Burger D, Wang C (2024) Autogen: Enabling next-gen LLM applications via multi-agent conversation. In ICLR 2024 Workshop on Large Language Model (LLM) Agents. URL https://openreview.net/forum?id=uAjxFFing2
Xi Z, Chen W, Guo X, He W, Ding Y, Hong B, Zhang M, Wang J, Jin S, Zhou E, et al (2023) The rise and potential of large language model based agents: A survey. Preprint at arXiv:2309.07864
Xiang, B; Hanchen, W; Liya, M; Yongchao, X; Jiefeng, G; Ziwei, F; Fan, Y; Ke, M; Jiehua, Y; Song, B et al. Advancing covid-19 diagnosis with privacy-preserving collaboration in artificial intelligence. Nature Mach Intell; 2021; 3,
Xiao C, Li B, Zhu J-Y, He W, Liu M, Song D (2018) Generating adversarial examples with adversarial networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, pp. 3905-3911
Xiao C, Zhong P, Zheng C (2019) Enhancing adversarial defense by k-winners-take-all. Preprint at arXiv:1905.10510
Xie C, Wu Y, van der Maaten L, Yuille AL, He K (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xu W, Evans D, Qi Y (2018) Feature squeezing: Detecting adversarial examples in deep neural networks. In Proceedings 2018 Network and Distributed System Security Symposium, NDSS 2018. Internet Society. https://doi.org/10.14722/ndss.2018.23198
Xu J, Ma MD, Wang F, Xiao C, Chen M (2023) Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models
Xue F-F, Peng J, Wang R, Zhang Q, Zheng W-S (2019) Improving robustness of medical image diagnosis with denoising convolutional neural networks. In Dinggang Shen, Tianming Liu, Terry M. Peters, Lawrence H. Staib, Caroline Essert, Sean Zhou, Pew-Thian Yap, and Ali Khan, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, pages 846–854, Cham. Springer International Publishing
Yan Z, Guo Y, Zhang C (2018) Deep defense: Training dnns with improved adversarial robustness. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/8f121ce07d74717e0b1f21d122e04521-Paper.pdf
Yang Y-Y, Rashtchian C, Zhang H, Salakhutdinov RR, Chaudhuri K (2020) A closer look at accuracy vs. robustness. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pp. 8588–8601. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/61d77652c97ef636343742fc3dcf3ba9-Paper.pdf
Yang J, Boloor A, Chakrabarti A, Zhang X, Vorobeychik Y (2021) Finding physical adversarial examples for autonomous driving with fast and differentiable image compositing. https://openreview.net/forum?id=a7gkBG1m6e
Yao, D; Tiehua, Z; Guannan, L; Xi, Z; Jiong, J; Qing-Long, H. Deep learning-based autonomous driving systems: a survey of attacks and defenses. IEEE Trans Industr Inf; 2021; 17,
Yi M, Hou L, Sun J, Shang L, Jiang X, Liu Q, Ma Z (2021) Improved ood generalization via adversarial training and pretraing. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11987–11997. https://proceedings.mlr.press/v139/yi21a.html
Yoshida K, Fujino T (2020) Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks. In Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security, AISec’20, pp. 117-127, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3411508.3421375
Yousefpour A, Shilov I, Sablayrolles A, Testuggine D, Prasad K, Malek M, Nguyen J, Ghosh S, Bharadwaj A, Zhao J, Cormode G, Mironov I (2021) Opacus: User-friendly differential privacy library in pytorch. In NeurIPS 2021 Workshop Privacy in Machine Learning. https://openreview.net/forum?id=EopKEYBoI-
Yu H, Yang K, Zhang T, Tsai Y-Y, Ho T-Y, Jin Y (2020) Cloudleak: Large-scale deep learning models stealing through adversarial examples. In Network and Distributed System Security Symposium
Yupeng Chang, Xu; Wang, Wang Jindong; Yuan, Wu; Linyi, Yang; Kaijie, Zhu; Hao, Chen; Xiaoyuan, Yi; Cunxiang, Wang; Yidong, Wang et al. A survey on evaluation of large language models. ACM Trans Intell Syst Technol; 2024; 15,
Zantedeschi V, Nicolae M-I, Rawat A (2017) Efficient defenses against adversarial attacks. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, pp. 39-49, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3128572.3140449
Zhang H, Yu Y, Jiao J, Xing E, El Ghaoui L, Jordan M (2019a) Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472–7482. PMLR
Zhang H, Yu Y, Jiao J, Xing E, Ghaoui LE, Jordan M (2019b) Theoretically principled trade-off between robustness and accuracy. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7472–7482. PMLR, 09–15. https://proceedings.mlr.press/v97/zhang19p.html
Zhang Y, Foroosh H, David P, Gong B (2019c) CAMOU: Learning physical vehicle camouflages to adversarially attack detectors in the wild. In International Conference on Learning Representations. https://openreview.net/forum?id=SJgEl3A5tm
Zhang B, Cai T, Lu Z, He D, Wang L (2021a) Towards certifying l-infinity robustness using neural networks with l-inf-dist neurons. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 12368–12379. https://proceedings.mlr.press/v139/zhang21b.html
Zhang Z, Chen Y, Wagner D (2021b) Seat: Similarity encoder by adversarial training for detecting model extraction attack queries. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, AISec ’21, pp. 37-48, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3474369.3486863
Zhang, A; Xing, L; Zou, J; Joseph, W. Shifting machine learning for healthcare from development to deployment and from models to data. Nature Biomed Eng; 2022; 6, pp. 1-16. [DOI: https://dx.doi.org/10.1038/s41551-022-00898-y]
Zhang B, Jiang D, He D, Wang (2022b) Boosting the certified robustness of l-infinity distance nets. Preprint at https://arxiv.org/abs/2110.06850
Zhang B, Jiang D, He D, Wang L (2022c) Rethinking lipschitz neural networks and certified robustness: a boolean function perspective. Preprint at https://arxiv.org/abs/2210.01787
Zhang Q, Hu S, Sun J, Chen QA, Mao ZM (2022d) On adversarial robustness of trajectory prediction for autonomous vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15159–15168
Zhao, J; Chen, Y; Zhang, W. Differential privacy preservation in deep learning: challenges, opportunities and solutions. IEEE Access; 2019; 7, pp. 48901-48911. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2909559]
Zhao, W; Alwidian, S; Mahmoud, QH. Adversarial training methods for deep learning: a systematic review. Algorithms; 2022; [DOI: https://dx.doi.org/10.3390/a15080283]
Zheng Z, Hong P (2018) Robust detection of adversarial attacks by modeling the intrinsic properties of deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/e7a425c6ece20cbc9056f98699b53c6f-Paper.pdf
Zheng H, Ye Q, Hu H, Fang C, Shi J (2019) Bdpl: a boundary differentially private layer against machine learning model extraction attacks. In Kazue Sako, Steve Schneider, and Peter Y. A. Ryan, editors, Computer Security – ESORICS 2019, pp. 66–83, Cham. Springer International Publishing
Zhibo, W; Jingjing, M; Xue, W; Jiahui, H; Zhan, Q; Kui, R. Threats to training: a survey of poisoning attacks and defenses on machine learning systems. ACM Comput Surv; 2022; 55,
Zhou H, Li W, Kong Z, Guo J, Zhang Y, Yu B, Zhang L, Liu C(2020) Deepbillboard: Systematic physical-world testing of autonomous driving systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, pp. 347-358, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3377811.3380422
S Zhouxing, W Yihan, Z Huan, Y Jinfeng, H Cho-Jui (2021) Fast certified robust training with short warmup. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems. https://openreview.net/forum?id=AQ9UL-7UvZx
Zhu K, Wang J, Zhou J, Wang Z, Chen H, Wang Y, Yang L, Ye W, Zhang Y, Gong NZ, Xie X (2023) PromptRobust: towards evaluating the robustness of large language models on adversarial prompts. Preprint at https://doi.org/10.48550/arXiv.2306.04528
Zibaeirad A, Koleini F, Bi S, Hou T, Wang T (2024) A comprehensive survey on the security of smart grid: Challenges, mitigations, and future research opportunities. https://arxiv.org/abs/2407.07966
Ziller A, Trask A, Lopardo A, Szymkow B, Wagner B, Bluemke E, Nounahon J-M, Passerat-Palmbach J, Prakash K, Rose N, Ryffel T, Reza ZN, Kaissis G (2021a) PySyft: A Library for Easy Federated Learning, pages 111–139. Springer International Publishing, Cham. ISBN 978-3-030-70604-3. https://doi.org/10.1007/978-3-030-70604-3_5
Zongkun, S; Yinglong, W; Minglei, S; Ruixia, L; Huiqi, Z. Differential privacy for data and model publishing of medical data. IEEE Access; 2019; 7, pp. 152103-152114. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2947295]
Zou A, Wang Z, Kolter JZ, Fredrikson M (2023) Universal and transferable adversarial attacks on aligned language models. Preprint at arXiv:2307.15043
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.