Benedikt Stroebl

I'm the CTO and cofounder of Ludus Labs. Before that, I was a PhD student at Princeton University, advised by Arvind Narayanan.

At Ludus Labs, we're building physically-simulated environments for AI agents to compete in sports-like competitions, creating a new form of entertainment while advancing AI capabilities.

My research focuses on AI agents, with a focus on enhancing their real-world usefulness and reliability. Part of that is developing rigorous evaluation frameworks and studying the limitations of inference scaling techniques.

[Google Scholar] [GitHub] [X]

Selected Publications & Projects

HAL: The Holistic Agent Leaderboard for Centralized and Reproducible Agent Evaluation Benedikt Stroebl*, Sayash Kapoor*, Arvind Narayanan (2025)
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models Veniamin Veselovsky*, Berke Argin*, Benedikt Stroebl*, Chris Wendler, Robert West, James Evans, Thomas L. Griffiths, Arvind Narayanan (2025)
Inference Scaling 𝙛Laws: The Limits of LLM Resampling with Imperfect Verifiers Benedikt Stroebl*, Sayash Kapoor, Arvind Narayanan arXiv preprint 2411.17501 (2024)
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan arXiv preprint 2409.11363 (2024)
AI Agents That Matter Sayash Kapoor*, Benedikt Stroebl*, Zachary S. Siegel, Nitya Nadgir, Arvind Narayanan arXiv preprint 2407.01502 (2024)

(* indicates equal contribution)

Blog Posts

Is AI progress slowing down? Making sense of recent technology trends and claims. Arvind Narayanan, Benedikt Stroebl, Sayash Kapoor AI Snake Oil (2024)
AI leaderboards are no longer useful. It's time to switch to Pareto curves. Sayash Kapoor*, Benedikt Stroebl*, Arvind Narayanan AI Snake Oil (2024)

Workshops

Workshop on Useful and Reliable AI Agents Princeton University. 600+ attendees. Virtual Workshop. August 2024.

Talks & Press

Building and evaluating AI agents that matter. AWS Applied Scientists. Invited talk. March 2025.
AI agents that matter. Weaviate Podcast. Podcast. September 2024.
AI agent benchmarks are misleading, study warns. VentureBeat. News article. June 2024.
The perils of evaluating AI agents. Meta (Core Applied Sciences). Invited talk. May 2024.