March 13, 2025
Safe and Effective
Advancing Department of Defense Test and Evaluation for AI and Autonomous Systems
Executive Summary
With each passing year, the promise of artificial intelligence (AI) and autonomy to change the character of war inches closer to reality. This technology will continue to expand the pace, volume, and scope of future conflict, as seen in reports of basic autonomy used in Ukraine to guide drones or in Israel’s reported use of AI to perform targeting in Gaza. While the United States and its allies and partners contend with how to develop and field these systems responsibly, it is more critical than ever to establish rigorous methods for assuring their compliance with technical best practices, U.S. military policy, and future operational concepts.
Although the Department of Defense (DoD) has a storied and successful history of deploying increasingly advanced technology in war, AI and autonomous systems present unique challenges, from their lack of explainability in selecting courses of action to their sensitivity to the training data used in developing them. As this technology’s development moves at breakneck pace among the United States and its adversaries, now is the time to establish a technically grounded and dynamic framework across the Joint Force to ensure the DoD continues its history of safe and effective deployment.
Summary of Recommendations
This study yielded findings that apply across the lifecycle of AI and autonomous systems. These findings can be placed into two broad groups: those relating to research and system development, which concern the technical elements of designing and engineering these systems, and those relating to test and evaluation (T&E), which concern the practical and policy elements of virtual and live testing. Creating safe and effective AI and autonomous systems is not the responsibility of any one part of the acquisition or operational communities, but rather an effort that spans the entire system lifecycle. Therefore, these findings and recommendations apply broadly to several interconnected communities, from engineering and design teams to testers and operators.
Research and System Development
The DoD should develop a set of common characteristics of AI and autonomous systems that present novel assurance and sustainment concerns. This can enable the acquisition community to identify these projects and potentially unlock access to additional resources and oversight early in the development process, when catching and rectifying errors is cheapest. Rather than relying on common and comprehensive definitions of AI and autonomy, or self-reporting from vendors or program offices, AI and autonomous systems can better be captured through the characteristics that lend them their unique developmental, test, and sustainment concerns. Simultaneously, this avoids scope creep as well-understood legacy applications of AI and autonomy may be unintentionally captured in overly comprehensive definitions.
The acquisition community should incorporate required information sharing about system design techniques and engineering practices into contract language. This required information sharing should provide greater transparency about which elements of a system do or do not incorporate AI techniques that present novel assurance or sustainment concerns, helping to avoid “hidden AI” systems of which users or testers may be unaware.
The services and acquisition community should support embedding operators with developers early in the program lifecycle. Given the well-understood higher cost of design changes that accrue later in system development, this can reduce costs and improve efficiency by catching potential issues earlier, before deploying the system to its intended end users.
System developers and program offices should emphasize the importance of explainability for operators early. While the novel tactics learned through training machine learning (ML) systems may demonstrate performance gains, they can inhibit later adoption of the technology if not balanced against the concerns of test personnel and operators further down the line.
Test and Evaluation
Offices that establish and implement Responsible AI (RAI) policy should work with program offices and the engineering community to develop explicit guidance that supports testable metrics for RAI principles compliance, informed by current engineering practice. Much of this work should focus on translating high-level principles like “equitability” into performance parameters that testers can assess. While these procedures may be highly specific to the variety of system being tested, as well as any AI or ML techniques being employed, sharing case studies for testing best practices can develop the institutional knowledge among testers that has emerged for legacy systems, such as crewed aircraft.
The department, through the acquisition and research and engineering communities, should provide the T&E enterprise, as well as contractors developing AI-enabled systems, with common, validated modeling and simulation tools that avoid ad hoc or redundant tool development efforts. This may include sharing environmental, friendly, or adversarial threat models.
Program offices should develop and maintain an early, ongoing dialogue with testers and operators to iterate on key metrics related to system performance to ensure they are well aligned both to the engineering processes underlying the system development and to evolving mission concepts. Iterating on metrics matches industrial best practice in AI and ML system development, recognizing that technical measures of module and system performance may not correlate directly to operational performance.
The DoD T&E enterprise should broaden test safety policies to consider the unique profiles of AI-enabled and autonomous systems to improve the efficiency of test efforts. While large AI-enabled and autonomous systems may still present financial and operational risks in the event of failure, especially when they enable uncrewed and robotic platforms, the risk to human lives is different from traditional crewed systems.
Download the Full Report
More from CNAS
-
Catalyzing Crisis
The arrival of ChatGPT in November 2022 initiated both great excitement and fear around the world about the potential and risks of artificial intelligence (AI). In response, s...
By Bill Drexel & Caleb Withers
-
Obstacles and Opportunities for Transformative Change
Watch:...
By Paul Scharre
-
Every Country Is on Its Own on AI
But establishing such an institution quickly enough to match AI’s accelerating progress is likely a pipe dream, given the history of nuclear arms controls and their status tod...
By Bill Drexel & Michael Depp
-
The Time to Regulate AI Is Now
Policymakers should also be under no illusion that a light regulatory touch will somehow prevent a degree of concentration at AI’s frontier....
By Caleb Withers