τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

The chart shows Pass^1 results over time. All passes are included in the table below.

Airline

Retail

Airline Results

Date	Strategy	Model	Pass^1	Pass^2	Pass^3	Pass^4
10/22/2024	TC	claude-3-5-sonnet-20241022	0.460	0.326	0.263	0.225
7/18/2024	TC	gpt-4o-mini	0.225	0.140	0.110	0.100
6/17/2024	TC	gpt-4o	0.420	0.273	0.220	0.200
6/17/2024	Act	gpt-4o	0.365	0.217	0.160	0.140
6/17/2024	ReAct	gpt-4o	0.325	0.233	0.185	0.160
4/10/2024	TC	claude-3-5-sonnet-20240620	0.360	0.224	0.169	0.139

Date	Strategy	Model	Pass^1	Pass^2	Pass^3	Pass^4
10/22/2024	TC	claude-3-5-sonnet-20241022	0.692	0.576	0.509	0.462
6/17/2024	TC	gpt-4o	0.604	0.491	0.430	0.383
4/10/2024	TC	claude-3-5-sonnet-20240620	0.626	0.506	0.435	0.387

TC = `tool-calling` strategy (the function-calling strategy reported in the paper)