τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

The chart shows Pass^1 results over time. All passes are included in the table below.

Airline
Retail

Airline Results

DateStrategyModelPass^1Pass^2Pass^3Pass^4
10/22/2024TCclaude-3-5-sonnet-202410220.4600.3260.2630.225
7/18/2024TCgpt-4o-mini0.2250.1400.1100.100
6/17/2024TCgpt-4o0.4200.2730.2200.200
6/17/2024Actgpt-4o0.3650.2170.1600.140
6/17/2024ReActgpt-4o0.3250.2330.1850.160
4/10/2024TCclaude-3-5-sonnet-202406200.3600.2240.1690.139

Retail Results

DateStrategyModelPass^1Pass^2Pass^3Pass^4
10/22/2024TCclaude-3-5-sonnet-202410220.6920.5760.5090.462
6/17/2024TCgpt-4o0.6040.4910.4300.383
4/10/2024TCclaude-3-5-sonnet-202406200.6260.5060.4350.387

TC = `tool-calling` strategy (the function-calling strategy reported in the paper)