Build your own benchmark#
Overview#
Crab benchmark system mainly consists of five types of component:
Action
: The fundamental building block of Crab framework, which represents a unit operation that can be taken by an agent or as a fixed process that called multi times in a benchmark.Evaluator
: A specific type ofAction
that assess whether an agent has achieved its goal. Multiple evaluators can be combined together as a graph to enable complex evaluation.Environment
A abstraction of an environment that the agent can take action and obverse in a given action and observation space. An environment can be launched on the local machine, a physical remote machine, or a virtual machine.Task
: A task with a natural language description to instruct the agent to perform. It can include interaction with multiple environments. Notice that in the benchmark, a task should have an graph evaluator to judge if the task progress.Benchmark
: The main body of the crab system that contains all required component to build a benchmark, including environments, tasks, prompting method. It controls several
Actions#
Actions are the fundamental building blocks of the Crab system’s operations. Each action is encapsulated as an instance of the Action
class. An action can convert into a JSON schema for language model agents to use.
An action is characterized by the following attributes:
Name: A string identifier uniquely represents the action.
Entry: A callable entry point to the actual Python function that executes the action.
Parameters: A Pydantic model class that defines the input parameters the action accepts.
Returns: A Pydantic model class that defines the structure of the return type the action produces.
Description: An string providing a clear and concise description of what the action does and how it behaves.
Kept Parameters: A list of parameters retained for internal use by the Crab system, which do not appear in the action’s parameter list but are injected automatically at runtime. For exmaple we use
env
to represent the current environment object that action are taken in.Environment Name: An optional string that can specify the environment the action is associated with. Usually this attribute is only used by predifined actions like
setup
in an environment.
Here is an example of creating an action through python function:
@action
def click(x: float, y: float) -> None:
"""
click on the current desktop screen.
Args:
x (float): The X coordinate, as a floating-point number in the range [0.0, 1.0].
y (float): The Y coordinate, as a floating-point number in the range [0.0, 1.0].
"""
import pyautogui
pyautogui.click(x,y)
The @action
decorator transforms the click
function into an Action
with these mappings:
The function name
click
becomes the action name.The parameters
x: float, y: float
with their type hints become the action parameters.The return type hint
-> None
is used for the action’s returns field, indicating no value returned.The function’s docstring provides a description for the action and its parameters, utilized in the JSON schema for the agent.
The function body defines the action’s behavior, executed when the action is called.
The Action
class allows for different combination operations such as:
Pipe: Using the
>>
operator, actions can be piped together, where the output of one action becomes the input to another, provided their parameters and return types are compatible.Sequential Combination: The
+
operator allows for two actions to be combined sequentially, executing one after the other.
Evaluators#
Evaluators in the Crab system are a specific type of Action
that assess whether an agent has achieved its goal. They should return a boolean value, indicating whether the task’s objective has been met. Multiple evaluators can be connected into a graph using the networkx
package, enabling multi-stage evaluation, where different conditions can be checked in sequence or in parallel.
An example evaluator check_file_exist
confirms the presence of a file at a given path, using the os.path.isfile
method to return True
if the file exists or False
otherwise:
@evaluator
def check_file_exist(file_path: str) -> bool:
return os.path.isfile(file_path)
Extra attributes of evaluators:
Require Submit: Indicates if the evaluator awaits a specific submission to carry out its assessment.
Logical operators allow for evaluator combinations:
AND (&): Requires all evaluators to succeed for a task to pass.
OR (|): Passes if any of the evaluators succeed.
NOT (~): Reverses the evaluation outcome.
The combined evaluator is still considered as one evaluator rather than a graph evaluator.