Note: understanding tokens helps you design faster, more stable, and cheaper agents to operate at scale. See: Pricing
What is a token
A token is a unit of text that the model processes. It does not map 1:1 to “one word”: it can be a complete word or a fragment. In a typical execution, you consume tokens for:- Input: the user’s message + the agent’s instruction + the context you inject.
- Output: the model’s response.
- Tools and nodes: for example, when a node fetches data (like a JSON) and that data is used as context.
The mental rule: tokens = fuel
1) More weight, more consumption
If your agent loads too many instructions, examples, redundant rules, or unnecessary data, the prompt “weighs” more. Common causes:- You write long or repeated instructions.
- You include too much history or “copy-paste” information into the context.
- You receive tool responses with large payloads.
2) The further you travel, the more you spend
Conversations with many turns tend to accumulate useful context… and sometimes noise too. If your flow depends on long history, consider:- Summarizing or normalizing key information.
- Saving only what you need for the next step (not the entire chat).
3) The model also matters
“Larger” or more capable models tend to be more expensive to run. In general:- Use a specialized or lightweight model for simple tasks (routing, validations, extraction).
- Reserve a more capable model for complex reasoning or richer generation.
What typically drives consumption in a flow
- Extensive prompts (especially if they include repeated text).
- Integrations/APIs that return too much content (huge catalogs, logs, unfiltered JSONs).
- Variables persisted without criteria that you include again in every step.
- Long responses when the user only needs a short/structured output.
Warning signs
If you notice any of these symptoms, you are probably consuming more tokens than necessary:| Symptom | Likely cause | Solution |
|---|---|---|
| High latency (>5s) | Context too large | Reduce history, filter data |
| Truncated responses | Output limit reached | Request shorter or structured responses |
| ”context length exceeded” error | Input exceeds the limit | Reduce instructions or context |
| Inconsistent responses | Too much noise in the prompt | Clean up and prioritize relevant information |
| Unexpected costs | Tool/API tokens | Filter integration responses |
Example: before vs after
- Before (heavy)
- After (optimized)
Best practices for optimizing (without losing quality)
Make “lean” prompts
A good prompt tends to be:- Brief
- With clear rules
- Without unnecessary examples
- With explicit output format (if applicable)
Decide where each piece of data lives: Context vs Memory
Not everything needs to persist.- Context: lives only during the current execution. Use it for temporary calculations and intermediate steps. See: Context
- Memory: persists across conversations/skills with a time-to-live (TTL) control. Use it for data you truly reuse (e.g., preferences or identifiers). See: Memory
Limit the information you bring from tools
If you use APIs, filter from the source:- Request only the necessary fields.
- Paginate results.
- Avoid fetching blobs or complete catalogs “just in case.”