τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
The chart shows Pass^1 results over time. All passes are included in the table below.
Airline
Retail
Airline Results
Date | Strategy | Model | Pass^1 | Pass^2 | Pass^3 | Pass^4 |
---|---|---|---|---|---|---|
10/22/2024 | TC | claude-3-5-sonnet-20241022 | 0.460 | 0.326 | 0.263 | 0.225 |
7/18/2024 | TC | gpt-4o-mini | 0.225 | 0.140 | 0.110 | 0.100 |
6/17/2024 | TC | gpt-4o | 0.420 | 0.273 | 0.220 | 0.200 |
6/17/2024 | Act | gpt-4o | 0.365 | 0.217 | 0.160 | 0.140 |
6/17/2024 | ReAct | gpt-4o | 0.325 | 0.233 | 0.185 | 0.160 |
4/10/2024 | TC | claude-3-5-sonnet-20240620 | 0.360 | 0.224 | 0.169 | 0.139 |
Retail Results
Date | Strategy | Model | Pass^1 | Pass^2 | Pass^3 | Pass^4 |
---|---|---|---|---|---|---|
10/22/2024 | TC | claude-3-5-sonnet-20241022 | 0.692 | 0.576 | 0.509 | 0.462 |
6/17/2024 | TC | gpt-4o | 0.604 | 0.491 | 0.430 | 0.383 |
4/10/2024 | TC | claude-3-5-sonnet-20240620 | 0.626 | 0.506 | 0.435 | 0.387 |
TC = `tool-calling` strategy (the function-calling strategy reported in the paper)