Quickstart#
The Benchmark
class is a comprehensive framework for evaluating language model agents across various tasks and environments. It provides a flexible structure to manage multiple environments and tasks, offering single and multi-environment execution modes.
The following image shows an overview of how Benchmark
works.
Basic Usage#
Step 1: Importing the Benchmark#
Begin by importing the predefined benchmark from the crab.benchmarks
module. For exmple, here we import template_benchmark_config
:
from crab.benchmarks import template_benchmark_config
Step 2: Creating the Benchmark#
Use the create_benchmark
function to create an instance of a Benchmark
class based on the imported benchmark configuration:
from crab import create_benchmark
benchmark = create_benchmark(template_benchmark_config)
Step 3: Starting a Task#
Select a task to start within the benchmark. The task ID should correspond to one of the predefined tasks in the benchmark configuration. Use the start_task
method to initialize and begin the task:
# Starting the task with ID "0"
task, action_space = benchmark.start_task("0")
Step 4: Running the Benchmark Loop#
Execute actions and observe the results using the step
and observe
methods:
from crab.client.openai_interface import OpenAIAgent
# Initialize the agent by benchmark task and action_space
agent = OpenAIAgent(task, action_space)
# Define a function to run the benchmark
def run_benchmark(benchmark, agent):
for step in range(20): # Define the number of steps as per your requirements
print("=" * 40)
print(f"Starting step {step}:")
# Get the current observations and prompts
observation = benchmark.observe()
# Process the observations and determine the next action
action_result = agent.determine_next_action(observation)
# Execute the action and get the result
step_result = benchmark.step(action_result.action, action_result.parameters)
# Check current evaluation result.
print(step_result.evaluation_results)
# Check if the task is terminated and break the loop if so
if step_result.terminated:
print("Task completed successfully.")
print(step_result.evaluation_results)
break
run_benchmark(benchmark, agent)
Step 5: Completing the Benchmark#
Clean up and reset the benchmark after completion using thereset
:
benchmark.reset()