Tenzro Testnet is live. Get testnet TNZO

Inference

The inference system provides intelligent request routing to AI model providers. It implements multiple routing strategies (price, latency, reputation, weighted), handles failover, calculates costs, and integrates with TEE attestation for verifiable inference results.

Inference Request Lifecycle

An inference request flows through the Tenzro Network from user to model provider and back. The router handles provider selection, request forwarding, result verification, and settlement.

// Example inference request (conceptual)
{
  "model_id": "gemma4-9b",
  "input": "Explain quantum computing",
  "parameters": {
    "max_tokens": 500,
    "temperature": 0.7
  },
  "strategy": "HighestReputation"
}

// Response includes:
// - output: generated text
// - input_tokens: 3
// - output_tokens: 487
// - cost: 0.0525 TNZO
// - provider_id: provider address
// - attestation: optional TEE proof

The request includes the model ID, input text, optional parameters (temperature, max_tokens, etc.), requester address, and timestamp. The response contains the model output, token counts, cost, and optional attestation proof.

Routing Strategies

The inference router supports four routing strategies, each optimizing for different objectives. Users can specify their preferred strategy per request.

Routing Strategies:
  - Cheapest: Minimize cost
  - LowestLatency: Minimize response time
  - HighestReputation: Maximize quality
  - Weighted: Balance all factors

Examples:
  strategy: "Cheapest"         selects lowest-cost provider
  strategy: "LowestLatency"    selects fastest provider
  strategy: "HighestReputation"  selects most reliable provider
  strategy: "Weighted"         balances price, latency, and reputation

Cheapest Strategy

Selects the provider with the lowest total cost (input tokens + output tokens). Ideal for batch processing and non-latency-sensitive workloads. Filters active providers, calculates estimated cost for each, and selects the minimum.

Lowest Latency Strategy

Selects the provider with the lowest average latency based on historical metrics. Ideal for interactive applications like chatbots. Tracks average response time per provider and routes to the fastest.

Highest Reputation Strategy

Selects the provider with the highest reputation score. Reputation is calculated from success rate, uptime, and user feedback. Ideal for critical applications requiring maximum reliability.

Weighted Strategy

Balances price, latency, and reputation using configurable weights. This is the most flexible strategy for production workloads. Each factor is normalized to 0-1 range and combined using weighted sum.

// Weighted strategy example
{
  "weights": {
    "price": 0.2,       // 20% weight on cost
    "latency": 0.6,     // 60% weight on speed
    "reputation": 0.2   // 20% weight on reliability
  }
}

// Each provider gets a score:
// score = (price_weight × price_score) +
//         (latency_weight × latency_score) +
//         (reputation_weight × reputation_score)

Cost Calculation

Inference costs are calculated based on input and output token counts. Providers set per-token pricing in TNZO, and the router calculates the total cost before forwarding the request.

// Provider pricing example
Input price:  15 TNZO per 1M tokens
Output price: 60 TNZO per 1M tokens

// Cost calculation for 1500 input, 500 output tokens:
Input cost:  (1500 / 1,000,000) × 15 TNZO = 0.0225 TNZO
Output cost: (500 / 1,000,000) × 60 TNZO = 0.03 TNZO
Total cost:  0.0525 TNZO

Cost estimation uses historical token counts for the model. The actual cost is calculated after inference based on the real token counts. Any overpayment is refunded via micropayment channels.

Payment and Settlement

Inference payments use micropayment channels for per-token billing. Users open a channel with prepaid TNZO, and the provider deducts costs for each inference request.

// Micropayment channel example
1. Open channel with 100 TNZO prepayment
2. Each inference deducts from channel balance (e.g., 0.0525 TNZO)
3. Check remaining balance: 99.9475 TNZO
4. Close channel and settle final balance on-chain

Benefits:
- No on-chain transaction per inference
- Low-latency payments
- Batch settlement when closing channel

Micropayment channels enable high-frequency inference requests without on-chain transaction overhead. Channels are settled on-chain when closed or when balance runs low.

TEE-Attested Inference

Providers running in Trusted Execution Environments (TEE) can attest their inference results. This provides cryptographic proof that the output was computed securely without tampering.

// TEE-attested inference flow
1. User requests inference with require_tee: true
2. Provider runs model in trusted execution environment
3. TEE generates cryptographic attestation binding input→output
4. Response includes attestation proof
5. User verifies attestation to confirm secure execution

Attestation proves:
- Inference ran in genuine TEE hardware
- Output was computed from stated input
- No tampering with model or execution

TEE attestation binds the inference output to the specific model and provider. Users can verify that the output came from the claimed model running in a secure enclave, preventing model substitution attacks.

Zero-Knowledge Proof Verification

In addition to TEE attestation, providers can generate zero-knowledge proofs of correct inference execution. This enables verification without revealing the model weights.

// Zero-knowledge proof flow
1. Provider generates ZK proof of inference: proof(model_hash, input_hash, output_hash)
2. Proof confirms output was computed using registered model
3. Model weights remain private (zero-knowledge property)
4. User verifies proof without trusting provider hardware

Benefits over TEE:
- No hardware trust required
- Cryptographic verification
- Model weights never revealed

ZK proofs enable verification without trusting the provider hardware. The proof confirms that the output was computed using the registered model weights without revealing those weights.

Failover and Retry Logic

The inference router implements automatic failover when a provider is unavailable. If the primary provider fails, the router selects a backup provider using the same strategy.

// Failover configuration
max_retries: 3
timeout: 30 seconds
failover_enabled: true

// Automatic failover sequence:
1. Try primary provider (selected by routing strategy)
2. On failure, mark provider temporarily unavailable (circuit breaker)
3. Select backup provider using same strategy
4. Retry up to max_retries
5. Return error if all providers fail

Circuit breaker prevents cascading failures

Failover integrates with the circuit breaker pattern. Providers that fail consecutively are temporarily removed from the routing pool, preventing cascading failures.

Gossipsub Inference Protocol

Inference requests can be broadcast via the tenzro/inference/1.0.0 gossipsub topic. Multiple providers can compete to fulfill the request, with the fastest response winning.

// Broadcast inference request to network
{
  "type": "inference_request",
  "request_id": "req_123",
  "model_id": "0x1234...",
  "input": "Summarize this document: ...",
  "max_tokens": 200,
  "max_cost": 100000000000000000,  // 0.1 TNZO
  "requester": "0x5678...",
  "timestamp": 1711234567
}

// Provider responds via gossipsub
{
  "type": "inference_response",
  "request_id": "req_123",
  "output": "This document describes...",
  "cost": 75000000000000000,  // 0.075 TNZO
  "provider": "0xabcd...",
  "attestation": "0x...",
  "timestamp": 1711234570
}

Gossipsub-based inference enables decentralized request routing without a central router. Providers compete on speed and price, with the requester accepting the first valid response.

RPC Methods: tenzro_chat vs tenzro_inferenceRequest

The node exposes two inference methods. For chat applications and coding assistants, always prefer tenzro_chat as it applies the model's chat template for proper formatting.

MethodChat TemplateUse Case
tenzro_chatYes (auto-applied)Chat apps, coding assistants, conversational AI
tenzro_inferenceRequestNo (raw prompt)Batch processing, custom prompt formats, embeddings
// tenzro_chat — recommended for chat applications
// Uses the model's chat template (proper formatting per architecture)
curl -X POST https://rpc.tenzro.network \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tenzro_chat",
    "params": [{
      "model_id": "qwen3.5-0.8b",
      "message": "What is the capital of France?",
      "max_tokens": 200
    }],
    "id": 1
  }'

// tenzro_inferenceRequest — raw prompt, no chat template
curl -X POST https://rpc.tenzro.network \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tenzro_inferenceRequest",
    "params": [{
      "model_id": "gemma4-9b",
      "input": "What is the capital of France?",
      "strategy": "Cheapest",
      "max_tokens": 100
    }],
    "id": 1
  }'

Both methods accept a flat JSON object as params (not an array of arguments). The desktop app uses tenzro_chat to provide a ChatGPT-like interface with proper chat formatting.

Thinking Models and /no_think

Qwen 3, Qwen 3.5, and DeepSeek models have thinking/reasoning mode enabled by default. When thinking mode is active, the model produces <think>...</think> blocks containing chain-of-thought reasoning before the final answer.

// Thinking mode output (default for Qwen 3/3.5, DeepSeek):
// <think>
// The user is asking about the capital of France.
// The capital of France is Paris.
// </think>
// The capital of France is Paris.

// To disable thinking mode, append /no_think to your message:
{
  "jsonrpc": "2.0",
  "method": "tenzro_chat",
  "params": [{
    "model_id": "qwen3.5-0.8b",
    "message": "Write a hello world function /no_think",
    "max_tokens": 200
  }],
  "id": 1
}

// With /no_think, the model skips chain-of-thought and responds directly.
// This reduces token usage and latency for simple queries.

The /no_think suffix is handled automatically by the chat template. It only works with tenzro_chat, not tenzro_inferenceRequest (which does not apply chat templates).

Chat Interface

The desktop app and web interface provide a chat-style UI for multi-turn conversations. The chat history is maintained client-side and included in subsequent inference requests.

// Multi-turn conversation example

Turn 1:
Input: "What is quantum computing?"
Output: "Quantum computing uses quantum bits (qubits)..."

Turn 2 (includes history):
Input: "User: What is quantum computing?
        Assistant: Quantum computing uses quantum bits...
        User: Give an example"
Output: "For example, Shor's algorithm can factor large numbers..."

History is client-side to minimize costs. Only necessary context included.

Chat history management is client-side to minimize costs. Only the necessary context is included in each request, and older turns are pruned when the context window fills.

Inference History and Analytics

The wallet and desktop app track inference history for cost analysis and usage monitoring. Users can view total spend, token counts, and provider distribution.

// Query inference history
curl -X POST https://rpc.tenzro.network \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tenzro_getInferenceHistory",
    "params": ["0x5678..."],  // user address
    "id": 1
  }'

// Response
{
  "jsonrpc": "2.0",
  "result": [
    {
      "timestamp": 1711234567,
      "model_id": "0x1234...",
      "provider_id": "0xabcd...",
      "input_tokens": 1500,
      "output_tokens": 500,
      "cost": "0.0525",
      "strategy": "Cheapest"
    }
  ],
  "id": 1
}

Analytics help users optimize their inference costs by identifying expensive models, high-volume periods, and routing strategy effectiveness.