Accuracy-Cost Tradeoff on HumanEval

Full reproducibility materials for the paper titled "AI Agents That Matter"

Input ($/1M tokens)

Output ($/1M tokens)

GPT-4

GPT-3.5

Llama-3-8B

Llama-3-70B

Log Scale

Legend

Complex agent

✕

Baseline agent

Zero-shot model