Gecko mascotGecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls

1Australian National University 2CAMEL-AI.org 3Eigent.AI 4CSIRO's Data61 5Aipotheosis Labs, Inc.
TL;DR

Gecko is a virtual tool-use environment that lets LLM agents iteratively refine tool calls in a sandbox before real execution.

Gecko system overview

Overview of tool-call refinement with Gecko feedback, illustrated through a single dialogue. From left to right, we show three consecutive refinement attempts. [Left:] Tool call fails the argument validation check due to an incorrect filename format. [Middle:] Arguments have passed validation, but simulated execution reveals that the folder is created in a wrong directory, so the solution does not solve the task. [Right:] The planning LLM finally generates a correct solution to be safely executed in a real environment.

GATS improves tool-calling performance across BFCL v3 and tau^2-bench tasks

GATS uses Gecko's feedback from simulated tool executions to iteratively refine tool calls, consistently improving the performance of different LLMs on BFCLv3 and τ2-bench.

Gecko

Examples of Gecko feedback

Gecko is a simulation environment that grounds tool-calling agents before real execution. Given a task and tool calls from a planning model, Gecko runs a virtual execution pipeline: it validates tool calls, synthesizes schema-compliant responses, updates an evolving task state, and returns task-level feedback about completion status and remaining objectives. This design lets the planner receive actionable feedback while preserving consistency across multi-step calls in a single task context.

Argument Validator

Checks both syntax and semantics of tool arguments. It enforces schema rules (required fields, types, constraints) and uses an LLM to detect context-level issues such as wrong formats or incompatible values.

Response Generator

Produces realistic, schema-compliant tool outputs for valid calls. Generation is conditioned on tool definitions and current task state so outputs remain coherent with prior steps.

Task State Estimator

Maintains a compact task state that records cumulative effects of tool calls. After each call, it updates the state using the previous state and the latest response.

Task Feedback Generator

Uses an LLM judge to assess progress against task objectives. It returns success when objectives are satisfied, or reports remaining goals to guide the next refinement step.

API Schema Converter

Converts non-OpenAPI tool definitions into OpenAPI 3.1 schemas and validates them, enabling rapid integration of new tools into Gecko's simulation pipeline.

GATS

GATS (Grounding Agent Test-time Scaling) uses Gecko as an iterative refinement runtime.

1Session Init in Gecko

Start a new Gecko session, then load mock tools and the task state from the previous turn.

2Planning on Mock Tools

The planning LLM solves the task with mock tools in Gecko and produces an attempt solution.

3Judge and Feedback Loop

Gecko's judge checks whether the attempt solution solves the task based on the post-execution task state. If not solved, Gecko returns feedback and the process goes back to Step 1.

4Real Tool Execution

The planning LLM executes the best explored attempt solution on real tools.

5State Sync to Gecko

Real tool execution results are synchronized back to Gecko to calibrate this turn's task state for the next turn.

Experiments

BFCLv3 Main Evaluation

Method comparison on BFCLv3. We select eight most important metrics from BFCL website. Overall accuracy is the average of average Non-live single turn, average Live single turn, and Multi-turn. GATS consistently improves various planning LLMs.

Model Overall Acc Non-live single turn Live single turn Multi-turn
simple parallel multiple irrelevance simple multiple irrelevance base
State-of-the-art reference models
ToolACE-2-8B73.1288.0092.5092.5095.4170.9379.0184.8049.00
watt-tool-70B79.2798.2585.5094.0084.1686.0483.4768.4868.00
xLAM-2-70b80.9694.7592.0094.5083.3377.1371.1374.4877.50
Baseline models and our proposed method
GPT-4.1-nano58.8582.2578.5075.0080.8365.1158.9772.2232.00
+GATS67.5993.2588.5095.0081.2577.1369.8080.3837.50
GPT-4.1-mini66.2091.5084.5088.0078.3379.4570.9468.7040.00
+GATS73.8496.2588.0095.5084.5884.4974.5480.8350.50
GPT-4o76.9392.7592.5092.5084.1681.0078.5378.4561.00
+GATS84.6296.5095.0095.5095.8384.1081.0193.4272.00
GPT-5-thinking61.9478.0084.0076.0092.9161.6257.4589.7033.50
+GATS66.0885.0090.5083.0093.7567.4463.2490.3836.50
Gemini-2.5-pro66.4486.2569.0086.0091.6677.9062.2089.6839.50
+GATS70.4492.2575.0089.0092.5080.6267.9991.8344.00
Gemini-3.0-pro-preview79.9794.5091.0094.0082.5087.6080.4473.1969.00
+GATS85.1997.0093.0095.5094.1785.9382.3489.5973.50
Deepseek-V370.4097.0092.0094.0080.4186.0479.4872.5641.00
+GATS72.9097.2592.0095.5083.7588.7581.7678.7943.50
Qwen-3-14B73.7895.5092.5095.0084.5886.0480.8177.4448.00
+GATS78.6096.7593.5095.0092.5087.5983.0091.5054.00

τ2-bench Main Evaluation

Method comparison on τ2-bench. We report success rate under τ2-retail and τ2-airline subsets and average accuracy (Overall).

Model τ2-retail τ2-airline Overall
State-of-the-art reference models
Claude Opus 481.8%60.0%70.9%
Claude Sonnet 475.0%55.5%65.3%
Kimi-K2-Instruct70.6%56.5%63.6%
Baseline models and w/ GATS
GPT-4o62.9%45.5%54.2%
+GATS69.3%52.0%60.7%
GPT-5-mini73.5%57.0%65.3%
+GATS78.5%65.0%71.8%
GPT-5-thinking81.6%63.0%72.3%
+GATS84.6%68.0%76.3%
Gemini-3.0-pro-preview85.1%72.5%78.8%
+GATS88.2%76.5%82.3%

Comparison with Other Test-time Scaling Methods (τ2-airline)

Comparing GATS with various test-time scaling methods on τ2-airline. GPT-5-mini is used as the baseline planning LLM. We present performance, token usage, and average tool call numbers on τ2-airline. GATS has the best performance at a reasonable cost.

Method Performance Tokens (k) Avg. Tool Calls
GPT-5-mini57.0%80.36.70
w/ Reflexion60.0%102.96.62
w/ Merge-to-one53.5%111.25.88
w/ Self-refine59.0%259.623.88
w/ Best-of-n58.5%365.813.78
w/ GATS65.0%238.27.76

Discussion

Can Gecko be implemented only by prompting?

Technically yes if we can find a perfect prompt to let the model output all feedback and responses. But such a prompt is almost impossible to find, because our system combines rules with LLM reasoning, and even a successful single prompt would be too complex for reliable model understanding. We also tested a merge-to-one variant that collapses Gecko into one prompt, and it performs substantially worse than GATS.

Do simulation errors accumulate?

For single-turn tasks, errors can accumulate. For multi-turn tasks, real tools are used at the end of each turn, so the task state can be corrected and error accumulation is reduced. Regardless of this, under current evaluation protocols, a single tool-call error can still lead to task failure. In practical multi-turn settings, users can often help correct mistakes through follow-up interaction.

How does Gecko compare with StableToolBench?

Both can simulate API responses, but Gecko has several practical advantages. StableToolBench needs collected real responses for simulation, while Gecko can support new APIs from API descriptions. Gecko also includes argument validation and rejects invalid API calls with meaningful errors, which is closer to real API behavior. In addition, Gecko models multi-turn consistency with task state and conversation history rather than only isolated API calls.

What new research possibilities are enabled by Gecko?

Gecko can act as a verifier for tool-call data synthesis: given tasks, tool definitions, and tool-call sequences, it can simulate outcomes and return task-level feedback for filtering or correction. Gecko can also convert supervised tool-call datasets into reinforcement-learning environments: tool schemas define the action space, simulated execution gives observations, and judge-based checklist evaluation provides reward signals for offline or online training.

What are the current limitations?

Gecko currently focuses on text-output tools and does not yet support non-text outputs such as videos. For tools that depend on external databases, simulated outputs may diverge from real-world states. A practical mitigation is hybrid execution: simulate state-changing tools inside Gecko while calling real read-only query tools directly, balancing safety with lower simulation-reality drift.

Conclusion

Gecko is a simulation environment that takes tool calls from planning LLMs and returns layered feedback, from argument validity to simulated responses and task-state-aware task feedback. Based on this feedback loop, GATS iteratively refines tool calls at test time and consistently improves performance across tool-use benchmarks. Beyond test-time scaling, Gecko can serve as infrastructure for tool-call data verification and for constructing reinforcement-learning environments from tool schemas and task objectives.

Citation

@misc{zhang2026gecko,
  title={Gecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls},
  author={Zeyu Zhang and Guohao Li and Zhenchang Xing and Alexandros Apostolopoulos and Yu Lin Lee and Liang Zheng},
  year={2026},
  eprint={2602.19218},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2602.19218},
}