21 September 2025

Securing data in the AI supply chain

Justin Sherman

Underpinning AI technologies is a complex supply chain—organizations, people, activities, information, and resources that enable AI research, development, deployment, and more. The AI supply chain includes human talent, compute, and institutional and individual stakeholders. This report focuses on another element of the AI supply chain: data.

While a diversity of data types, structures, sources, and use cases exist in the AI supply chain, policymakers can easily fall into the trap of focusing on one AI data component at one moment (e.g., training data circa 2017), then switching focus to another AI data component next (e.g., model weights in current times), risking a lopsided policy that fails to take account of all the AI data components that are important for AI research and development (R&D). For example, overconfidence about which data element or attribute will most drive AI R&D can lead researchers and policymakers to skip past important, open questions (e.g., what factors might matter, in what combinations, and to what end), wrongly treating them as resolved. Put simply, a “one-size-fits-all” approach to AI-related data runs the risk of creating a regulatory, technological, or governance framework that overfocuses on one element of the data in the AI supply chain while leaving other critical parts and questions unaddressed.

Managing the risks to the data components of the AI supply chain—from errors to data leakage to intentional model exploitation and theft—will require a set of different, tailored approaches aimed at achieving a comprehensive reduction in risk. As conceptualized in this report, the data in the AI supply chain includes the data describing an AI model’s properties and behavior, as well as the data associated with building and using a model. It also includes AI models themselves and the different digital systems that facilitate the movement of data into and out of models. The report, therefore, spells out a framework to visualize the seven data components in the AI supply chain: training data, testing data, models (themselves), model architectures, model weights, Application Programming Interfaces (APIs), and Software Development Kits (SDKs).

It then uses the framework to map data components of the AI supply chain to three different ways that policymakers, technologists, and other stakeholders can potentially think about data risk: data at rest vs. in motion vs. in processing (focus on a data component within the supply chain and its current state); threat actor risk (focus on threat actors and risks to a data component within the supply chain); and supply chain due diligence and risk management (focus on a data component supplier or source within the supply chain and related actors).

No comments: