This latest US Army War College report finds that all four commercial AI systems they tested in early 2026 passed the rigorous USAWC oral comprehensive examination. The authors designed “MilBench,” a domain-specific benchmark that applied the War College’s standard capstone assessment to ChatGPT, Gemini, Claude, and Grok in conversational mode.
Three faculty panels administered the examination, scoring Claude at a mean GPA of 3.98 (A) and placing ChatGPT, Grok, and Gemini in a statistically indistinguishable B+ cluster. The multi-turn dialogue format exposed performance patterns, including brevity, sycophancy under pressure, and degradation over time, that static benchmarks fail to surface. The authors argue that the results shown below challenge the Department of War’s evaluation of commercial AI for strategic applications and call for domain-specific, dialogue-based assessment standards.
No comments:
Post a Comment