Splunk Inc.

03/26/2024 | News release | Archived content

Adversarial Machine Learning & Attacks on AIs

There's a catch to Artificial Intelligence: it is vulnerable to adversarial attacks.

Any AI has the potential to be reverse engineered and manipulated - due to the inherent limitations in its algorithms and training process. Improving the robustness and security of AI is key for the technology to live up to its hype, fueled by generative AI tools just like ChatGPT.

Enterprise organizations are readily adopting advanced generative AI agents for business applications ranging the gamut:


In this article, we will discuss how both the neural networks training process and modern machine learning algorithms are vulnerable to adversarial attacks.

Defining adversarial ML

Adversarial Machine Learning (ML) is the name for any technique that involves misguiding the neural networks model and its training process in order to produce a malicious outcome.

Associated with cybersecurity, adversarial AI can be considered a cyberattack vector. Adversarial techniques can be executed at several model stages:

  • During training

  • In the testing stage

  • When the model is deployed


How neural networks train

Consider the general training process of a neural network model. It involves feeding input data to a set of interconnected layers representing mathematical equations. The parameters of these equations are updated iteratively during the training process such that an input correctly maps to its true output.

Once the model is trained on adequate data, it is evaluated on previously unseen test data where the training is no longer performed - now, the model performance is evaluated.

Adversarial ML attack during the training stage

An adversarial ML attack during the training stage involves the modification of input data, features or the corresponding output labels.

Problem: Manipulating training data distributions

A model trained on sufficient data can model its underlying data distribution with high accuracy. This training data can belong to a complex set of data distributions.

An adversarial machine learning attack can be executed by manipulating the training data such that it partially or incorrectly captures the behavior of this underlying distribution. For example, the training data may not be sufficiently diverse, it may be altered or deleted.

Problem: Altering training labels

The training labels may be intentionally altered during the training stage. During the training process, the same model weights or parameters guide the model trajectory to a fixed decision boundary.

By altering the output class, features, categories or labels of the input data, the trained model weights cannot guide the output outside of this decision boundary and therefore produce incorrect results.

Problem: Injecting bad data

The training data may be injected with incorrect and malicious data. This process may subtly shift the decision boundary such that the evaluation metrics are generally within the acceptable performance thresholds, but the corresponding output classification may be significantly altered.

Adversarial impact on black box systems

Another important type of adversarial attack involves a framework that exploits an inherent problem in AI systems: most AI models are black-box systems.

In black-box AI, the systems are highly nonlinear and therefore exhibit high sensitivity and instability. These models are developed based on a set of input data and its corresponding output. We do not (and cannot) have knowledge of the inner workings of the system, but the model correctly maps an input to its true output.

White-box systems on the other hand are fully interpretable. We can understand how the model behaves and we have access to the model parameters with a complete understanding of its impact on the system behavior.

Black-box system attacks

Adversaries cannot obtain knowledge of the model underlying a black-box AI system. However, they can use any synthetic data that closely resembles the input and output from such a system to train a substitute model that emulates the behavior of a target model. This occurs due to the transferability characteristics of the AI model.

Transferability is the phenomenon where, given an adversary can construct adversarial data samples to exploit a model M1 by using knowledge of another model M2, as long as the model M2 can sufficiently perform the tasks that the model M1 is designed for.

White-box system attacks

In a white-box AI attack, adversaries have knowledge of the target model, including:

  • Its parameters

  • The algorithms used to train the model


A popular example involves the use of small perturbations to the input dataset such that it produces an incorrect output with high confidence of accuracy. These perturbations reflect worst-case scenarios that are used to exploit the sensitivity and nonlinear behavior of the neural networks model, which then converges to an incorrect decision class.

How to build robust AI systems

The same concepts of adversarial training and constructing adversarial examples can also be used to improve the robustness of an AI system. It can be used to regularize the model training, which imposes constraints on the models against extreme-case scenarios that force the model into misclassifying an output.

Adversarial training can be used to augment the training data to ensure that during the training process, the model is already exposed to a distribution of adversarial datasets. This also includes perturbed data that may be used to exploit the vulnerabilities of the AI models.