Inference
The inference system provides intelligent request routing to AI model providers. It implements multiple routing strategies (price, latency, reputation, weighted), handles failover, calculates costs, and integrates with TEE attestation for verifiable inference results.
Inference Request Lifecycle
An inference request flows through the Tenzro Network from user to model provider and back. The router handles provider selection, request forwarding, result verification, and settlement.
// Example inference request (conceptual)
{
"model_id": "gemma4-9b",
"input": "Explain quantum computing",
"parameters": {
"max_tokens": 500,
"temperature": 0.7
},
"strategy": "HighestReputation"
}
// Response includes:
// - output: generated text
// - input_tokens: 3
// - output_tokens: 487
// - cost: 0.0525 TNZO
// - provider_id: provider address
// - attestation: optional TEE proofThe request includes the model ID, input text, optional parameters (temperature, max_tokens, etc.), requester address, and timestamp. The response contains the model output, token counts, cost, and optional attestation proof.
Routing Strategies
The inference router supports four routing strategies, each optimizing for different objectives. Users can specify their preferred strategy per request.
Routing Strategies:
- Cheapest: Minimize cost
- LowestLatency: Minimize response time
- HighestReputation: Maximize quality
- Weighted: Balance all factors
Examples:
strategy: "Cheapest" → selects lowest-cost provider
strategy: "LowestLatency" → selects fastest provider
strategy: "HighestReputation" → selects most reliable provider
strategy: "Weighted" → balances price, latency, and reputationCheapest Strategy
Selects the provider with the lowest total cost (input tokens + output tokens). Ideal for batch processing and non-latency-sensitive workloads. Filters active providers, calculates estimated cost for each, and selects the minimum.
Lowest Latency Strategy
Selects the provider with the lowest average latency based on historical metrics. Ideal for interactive applications like chatbots. Tracks average response time per provider and routes to the fastest.
Highest Reputation Strategy
Selects the provider with the highest reputation score. Reputation is calculated from success rate, uptime, and user feedback. Ideal for critical applications requiring maximum reliability.
Weighted Strategy
Balances price, latency, and reputation using configurable weights. This is the most flexible strategy for production workloads. Each factor is normalized to 0-1 range and combined using weighted sum.
// Weighted strategy example
{
"weights": {
"price": 0.2, // 20% weight on cost
"latency": 0.6, // 60% weight on speed
"reputation": 0.2 // 20% weight on reliability
}
}
// Each provider gets a score:
// score = (price_weight × price_score) +
// (latency_weight × latency_score) +
// (reputation_weight × reputation_score)Cost Calculation
Inference costs are calculated based on input and output token counts. Providers set per-token pricing in TNZO, and the router calculates the total cost before forwarding the request.
// Provider pricing example
Input price: 15 TNZO per 1M tokens
Output price: 60 TNZO per 1M tokens
// Cost calculation for 1500 input, 500 output tokens:
Input cost: (1500 / 1,000,000) × 15 TNZO = 0.0225 TNZO
Output cost: (500 / 1,000,000) × 60 TNZO = 0.03 TNZO
Total cost: 0.0525 TNZOCost estimation uses historical token counts for the model. The actual cost is calculated after inference based on the real token counts. Any overpayment is refunded via micropayment channels.
Payment and Settlement
Inference payments use micropayment channels for per-token billing. Users open a channel with prepaid TNZO, and the provider deducts costs for each inference request.
// Micropayment channel example
1. Open channel with 100 TNZO prepayment
2. Each inference deducts from channel balance (e.g., 0.0525 TNZO)
3. Check remaining balance: 99.9475 TNZO
4. Close channel and settle final balance on-chain
Benefits:
- No on-chain transaction per inference
- Low-latency payments
- Batch settlement when closing channelMicropayment channels enable high-frequency inference requests without on-chain transaction overhead. Channels are settled on-chain when closed or when balance runs low.
TEE-Attested Inference
Providers running in Trusted Execution Environments (TEE) can attest their inference results. This provides cryptographic proof that the output was computed securely without tampering.
// TEE-attested inference flow
1. User requests inference with require_tee: true
2. Provider runs model in trusted execution environment
3. TEE generates cryptographic attestation binding input→output
4. Response includes attestation proof
5. User verifies attestation to confirm secure execution
Attestation proves:
- Inference ran in genuine TEE hardware
- Output was computed from stated input
- No tampering with model or executionTEE attestation binds the inference output to the specific model and provider. Users can verify that the output came from the claimed model running in a secure enclave, preventing model substitution attacks.
Zero-Knowledge Proof Verification
In addition to TEE attestation, providers can generate zero-knowledge proofs of correct inference execution. This enables verification without revealing the model weights.
// Zero-knowledge proof flow
1. Provider generates ZK proof of inference: proof(model_hash, input_hash, output_hash)
2. Proof confirms output was computed using registered model
3. Model weights remain private (zero-knowledge property)
4. User verifies proof without trusting provider hardware
Benefits over TEE:
- No hardware trust required
- Cryptographic verification
- Model weights never revealedZK proofs enable verification without trusting the provider hardware. The proof confirms that the output was computed using the registered model weights without revealing those weights.
Failover and Retry Logic
The inference router implements automatic failover when a provider is unavailable. If the primary provider fails, the router selects a backup provider using the same strategy.
// Failover configuration
max_retries: 3
timeout: 30 seconds
failover_enabled: true
// Automatic failover sequence:
1. Try primary provider (selected by routing strategy)
2. On failure, mark provider temporarily unavailable (circuit breaker)
3. Select backup provider using same strategy
4. Retry up to max_retries
5. Return error if all providers fail
Circuit breaker prevents cascading failuresFailover integrates with the circuit breaker pattern. Providers that fail consecutively are temporarily removed from the routing pool, preventing cascading failures.
Gossipsub Inference Protocol
Inference requests can be broadcast via the tenzro/inference/1.0.0 gossipsub topic. Multiple providers can compete to fulfill the request, with the fastest response winning.
// Broadcast inference request to network
{
"type": "inference_request",
"request_id": "req_123",
"model_id": "0x1234...",
"input": "Summarize this document: ...",
"max_tokens": 200,
"max_cost": 100000000000000000, // 0.1 TNZO
"requester": "0x5678...",
"timestamp": 1711234567
}
// Provider responds via gossipsub
{
"type": "inference_response",
"request_id": "req_123",
"output": "This document describes...",
"cost": 75000000000000000, // 0.075 TNZO
"provider": "0xabcd...",
"attestation": "0x...",
"timestamp": 1711234570
}Gossipsub-based inference enables decentralized request routing without a central router. Providers compete on speed and price, with the requester accepting the first valid response.
RPC Methods: tenzro_chat vs tenzro_inferenceRequest
The node exposes two inference methods. For chat applications and coding assistants, always prefer tenzro_chat as it applies the model's chat template for proper formatting.
| Method | Chat Template | Use Case |
|---|---|---|
tenzro_chat | Yes (auto-applied) | Chat apps, coding assistants, conversational AI |
tenzro_inferenceRequest | No (raw prompt) | Batch processing, custom prompt formats, embeddings |
// tenzro_chat — recommended for chat applications
// Uses the model's chat template (proper formatting per architecture)
curl -X POST https://rpc.tenzro.network \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"method": "tenzro_chat",
"params": [{
"model_id": "qwen3.5-0.8b",
"message": "What is the capital of France?",
"max_tokens": 200
}],
"id": 1
}'
// tenzro_inferenceRequest — raw prompt, no chat template
curl -X POST https://rpc.tenzro.network \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"method": "tenzro_inferenceRequest",
"params": [{
"model_id": "gemma4-9b",
"input": "What is the capital of France?",
"strategy": "Cheapest",
"max_tokens": 100
}],
"id": 1
}'Both methods accept a flat JSON object as params (not an array of arguments). The desktop app uses tenzro_chat to provide a ChatGPT-like interface with proper chat formatting.
Thinking Models and /no_think
Qwen 3, Qwen 3.5, and DeepSeek models have thinking/reasoning mode enabled by default. When thinking mode is active, the model produces <think>...</think> blocks containing chain-of-thought reasoning before the final answer.
// Thinking mode output (default for Qwen 3/3.5, DeepSeek):
// <think>
// The user is asking about the capital of France.
// The capital of France is Paris.
// </think>
// The capital of France is Paris.
// To disable thinking mode, append /no_think to your message:
{
"jsonrpc": "2.0",
"method": "tenzro_chat",
"params": [{
"model_id": "qwen3.5-0.8b",
"message": "Write a hello world function /no_think",
"max_tokens": 200
}],
"id": 1
}
// With /no_think, the model skips chain-of-thought and responds directly.
// This reduces token usage and latency for simple queries.The /no_think suffix is handled automatically by the chat template. It only works with tenzro_chat, not tenzro_inferenceRequest (which does not apply chat templates).
Chat Interface
The desktop app and web interface provide a chat-style UI for multi-turn conversations. The chat history is maintained client-side and included in subsequent inference requests.
// Multi-turn conversation example
Turn 1:
Input: "What is quantum computing?"
Output: "Quantum computing uses quantum bits (qubits)..."
Turn 2 (includes history):
Input: "User: What is quantum computing?
Assistant: Quantum computing uses quantum bits...
User: Give an example"
Output: "For example, Shor's algorithm can factor large numbers..."
History is client-side to minimize costs. Only necessary context included.Chat history management is client-side to minimize costs. Only the necessary context is included in each request, and older turns are pruned when the context window fills.
Inference History and Analytics
The wallet and desktop app track inference history for cost analysis and usage monitoring. Users can view total spend, token counts, and provider distribution.
// Query inference history
curl -X POST https://rpc.tenzro.network \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"method": "tenzro_getInferenceHistory",
"params": ["0x5678..."], // user address
"id": 1
}'
// Response
{
"jsonrpc": "2.0",
"result": [
{
"timestamp": 1711234567,
"model_id": "0x1234...",
"provider_id": "0xabcd...",
"input_tokens": 1500,
"output_tokens": 500,
"cost": "0.0525",
"strategy": "Cheapest"
}
],
"id": 1
}Analytics help users optimize their inference costs by identifying expensive models, high-volume periods, and routing strategy effectiveness.