The idea behind cascade routing is straightforward: try the cheapest provider first. If the answer is good enough, stop. If not, escalate to the next tier with the original question plus the first attempt as context.
Pura's cascade evaluates four signals to decide whether an answer meets the bar: response length relative to prompt complexity, hedging language, refusal patterns, and structural completeness. Each signal produces a 0-1 score. The weighted average determines whether to accept or escalate.
Three tiers: Groq (tier 1), Gemini (tier 2), OpenAI/Anthropic (tier 3). Default confidence threshold is 0.7.
Based on gateway traffic patterns before cascade was available:
If those ratios hold, blended cost drops roughly 73% compared to sending everything to GPT-4o.
curl https://api.pura.xyz/v1/chat/completions \
-H "Authorization: Bearer $PURA_KEY" \
-d '{
"messages": [{"role": "user", "content": "What is backpressure?"}],
"routing": {"cascade": true}
}'
The response includes X-Pura-Cascade-Depth, X-Pura-Cascade-Savings, and X-Pura-Confidence headers so you can see exactly what happened.
Check back for real numbers.