Format Pipeline
The format pipeline transforms tool responses into token-efficient output for LLMs using TOON (Token-Oriented Object Notation) with intelligent budget trimming and chunk-based lazy loading.
Overview
Output Formats
TOON Format
TOON (Token-Oriented Object Notation) is a compact, human-readable format designed to minimize token usage when passing structured data to LLMs.
We use the toon-format Rust crate (v0.4, spec v3.0) — a community-driven, MIT-licensed implementation.
- Website: toonformat.dev
- GitHub: toon-format/toon-rust
- Crate: crates.io/crates/toon-format
- Spec: TOON v3.0
Key features used:
- Key folding:
data.metadata.itemsinstead of nested blocks - Tabular arrays: shared headers for arrays of objects
- Minimal indentation: 1-space indent
Example
JSON (16 tokens):
TOON (13 tokens):
Trim Levels
The encoder supports three detail levels for progressive degradation:
Real-World Benchmarks
Benchmarks on popular open-source GitHub projects (budget: 8,000 tokens).
Run your own: devboy benchmark --owner <owner> --repo <repo>
TOON Full vs JSON (no trimming)
CPU Overhead
TOON encoding costs additional CPU compared to JSON serialization, but the overhead is negligible relative to network latency (~100-500ms for API calls) and LLM inference cost:
The absolute overhead is < 4ms even for 30+ items — orders of magnitude less than the token cost savings at LLM inference time.
TOON with Trim Levels (budget trimming active)
When budget trimming is applied, the pipeline progressively reduces detail level:
The budget pipeline automatically selects the optimal combination: first tries to fit all items at Full level, then progressively drops to Standard and Minimal for items that don't fit, prioritizing high-value items (determined by the trimming strategy).
Memory Usage
All allocations are heap-based and freed after processing. No persistent memory overhead.
The pipeline processes and releases memory synchronously within a single tool call — no background allocations or caches.
Key Takeaway
- TOON Full alone saves 3-17% tokens vs JSON (more with repetitive data structures)
- Trim Levels provide the real power: Standard saves ~44%, Minimal saves ~92%
- Combined with smart trimming: the pipeline maximizes information within any token budget by keeping the most important items at higher detail and less important items at lower detail or excluded entirely
Budget Trimming
The Pipeline::transform_*() methods use the budget pipeline internally for ALL output size control. The flow is: format all items → if fits budget, return → else run budget pipeline with strategy → produce chunk 1 + chunk index.
The trimming problem is modeled as a Tree Knapsack Problem (Cho & Shaw, 1997):
maximize Σ p(v) for v ∈ S subject to: Σ w(v) ≤ B; S is a connected subtree containing root(T)
Iterative Pipeline
Algorithm Selection
Chunk-Based Lazy Loading
When data exceeds the token budget, the pipeline splits output into sequential chunks. The first response returns chunk 1 (the highest-value items according to the active strategy) plus a chunk index describing all available chunks.
How It Works
- Budget pipeline determines which items fit in the budget (chunk 1)
- Remaining items are grouped into sequential chunks with content summaries
- The chunk index is appended to the response, describing each chunk
- The agent uses the
chunk: Nparameter in subsequent tool calls to fetch specific chunks - The agent can stop early if it finds the needed information without reading all chunks
Chunk Index Format
Each chunk entry shows the offset/limit boundaries, a content summary (file paths, counts, line changes), and which chunk is already included in the current response. Use chunk: N to fetch a specific chunk.
Provider Metadata
List-type provider responses are wrapped in ProviderResult<T>, which captures upstream pagination and sort metadata alongside the data items.
Metadata Sources
- GitLab: Extracts
X-TotalandX-Total-Pagesfrom response headers - Jira: Extracts
total,startAt,maxResultsfrom JQL response body
Data Flow
SortInfo
SortInfo describes the current ordering and available sort options:
sort_by— the sort field applied to the current response (e.g.,updated_at,created_at)sort_order— the sort direction (ascordesc)available_sorts— list of sort fields the provider supports (e.g.,created_at,updated_at,priority)
This metadata is passed through to FormatMetadata so agents can make informed decisions about re-querying with different sort orders or fetching additional pages.
Trimming Strategies
Each strategy assigns information value to tree nodes based on data type semantics.
1. Element Count (element_count)
For flat lists (issues, MRs). Value decreases by position: first = 1.0, last = 0.3.
Tools: get_issues, get_merge_requests
2. Cascading (cascading)
For comments with chronological decay: p(i) = β^(n-1-i), β = 0.95.
Newest comments are most valuable. Oldest of 50 gets ~8% value of newest.
Tools: get_issue_comments
3. Size-Proportional (size_proportional)
For diffs, weighted by file type importance:
Tools: get_merge_request_diffs
4. Thread-Level (thread_level)
For discussions: resolved = 0.3, unresolved = 1.0. First and last comment in each thread are always preserved.
Tools: get_merge_request_discussions
5. Head+Tail (head_tail)
For logs: 30% head (config/environment), 70% tail (errors/results).
Error patterns (ERROR|FATAL|Exception|panic) get boosted value.
Tools: get_job_logs
6. Default (default)
Uniform value 1.0 for all nodes. No semantic trimming.
Tools: get_pipeline, get_users, get_statuses
Strategy Resolution
The StrategyResolver maps tool names to strategies:
- Exact match in TOML
[format_pipeline.strategies]overrides - Hardcoded defaults by tool name
- Strip proxy prefix (
cloud__get_issues→get_issues) and retry 1-2 - Fallback to
defaultstrategy
Pagination via Offset/Limit
The primary pagination mechanism is offset/limit parameters on tool calls. When the pipeline produces a chunk index (see Chunk-Based Lazy Loading), agents use the offset and limit values from the chunk index to fetch specific chunks of data.
This replaces the earlier cursor-based approach with a simpler, stateless model:
- First request returns chunk 1 + chunk index
- Agent reads the chunk index to understand available data
- Agent calls the tool again with
chunk: Nfor the desired chunk - Agent can stop early — no need to consume all chunks sequentially
Token Estimation
Uses char-based approximation (~3.5 chars/token) instead of tiktoken-rs to avoid ~2MB binary size increase. The 20% margin in the budget pipeline compensates for estimation inaccuracy.
Crate Structure
Metadata & Compression Stats
Every format_output() call returns FormatResult with metadata:
NAPI Bridge Integration
When using format_output() from a NAPI bridge, serialize FormatResult as JSON to expose metadata:
Note: The NAPI
callToolWithMetadata()function is implemented in the consuming project's NAPI bridge layer, not in this repository.
Token Estimation
Tokens are estimated as chars * 10 / 35 (~chars / 3.5), which approximates Claude's tokenizer for mixed English/code content.