Inference

The inference system provides intelligent request routing to AI model providers. It implements multiple routing strategies (price, latency, reputation, weighted), handles failover, calculates costs, and integrates with TEE attestation for verifiable inference results.

Inference Request Lifecycle

An inference request flows through the Tenzro Network from user to model provider and back. The router handles provider selection, request forwarding, result verification, and settlement.

// Example inference request (conceptual)
{
  "model_id": "gemma4-9b",
  "input": "Explain quantum computing",
  "parameters": {
    "max_tokens": 500,
    "temperature": 0.7
  },
  "strategy": "HighestReputation"
}

// Response includes:
// - output: generated text
// - input_tokens: 3
// - output_tokens: 487
// - cost: 0.0525 TNZO
// - provider_id: provider address
// - attestation: optional TEE proof

The request includes the model ID, input text, optional parameters (temperature, max_tokens, etc.), requester address, and timestamp. The response contains the model output, token counts, cost, and optional attestation proof.

Routing Strategies

The inference router supports four routing strategies, each optimizing for different objectives. Users can specify their preferred strategy per request.

Routing Strategies:
  - Cheapest: Minimize cost
  - LowestLatency: Minimize response time
  - HighestReputation: Maximize quality
  - Weighted: Balance all factors

Examples:
  strategy: "Cheapest"        → selects lowest-cost provider
  strategy: "LowestLatency"   → selects fastest provider
  strategy: "HighestReputation" → selects most reliable provider
  strategy: "Weighted"        → balances price, latency, and reputation

Cheapest Strategy

Selects the provider with the lowest total cost (input tokens + output tokens). Ideal for batch processing and non-latency-sensitive workloads. Filters active providers, calculates estimated cost for each, and selects the minimum.

Lowest Latency Strategy

Selects the provider with the lowest average latency based on historical metrics. Ideal for interactive applications like chatbots. Tracks average response time per provider and routes to the fastest.

Highest Reputation Strategy

Selects the provider with the highest reputation score. Reputation is calculated from success rate, uptime, and user feedback. Ideal for critical applications requiring maximum reliability.

Weighted Strategy

Balances price, latency, and reputation using configurable weights. This is the most flexible strategy for production workloads. Each factor is normalized to 0-1 range and combined using weighted sum.

// Weighted strategy example
{
  "weights": {
    "price": 0.2,       // 20% weight on cost
    "latency": 0.6,     // 60% weight on speed
    "reputation": 0.2   // 20% weight on reliability
  }
}

// Each provider gets a score:
// score = (price_weight × price_score) +
//         (latency_weight × latency_score) +
//         (reputation_weight × reputation_score)

Cost Calculation

Inference costs are calculated based on input and output token counts. Providers set per-token pricing in TNZO, and the router calculates the total cost before forwarding the request.

// Provider pricing example
Input price:  15 TNZO per 1M tokens
Output price: 60 TNZO per 1M tokens

// Cost calculation for 1500 input, 500 output tokens:
Input cost:  (1500 / 1,000,000) × 15 TNZO = 0.0225 TNZO
Output cost: (500 / 1,000,000) × 60 TNZO = 0.03 TNZO
Total cost:  0.0525 TNZO

Cost estimation uses historical token counts for the model. The actual cost is calculated after inference based on the real token counts. Any overpayment is refunded via micropayment channels.

Payment and Settlement

Inference payments use micropayment channels for per-token billing. Users open a channel with prepaid TNZO, and the provider deducts costs for each inference request.

// Micropayment channel example
1. Open channel with 100 TNZO prepayment
2. Each inference deducts from channel balance (e.g., 0.0525 TNZO)
3. Check remaining balance: 99.9475 TNZO
4. Close channel and settle final balance on-chain

Benefits:
- No on-chain transaction per inference
- Low-latency payments
- Batch settlement when closing channel

Micropayment channels enable high-frequency inference requests without on-chain transaction overhead. Channels are settled on-chain when closed or when balance runs low.

Streaming with per-token billing

For streaming chat completions, attach a micropayment channel id with --channel. Each streamed token is billed against that channel; the channel state advances after every chunk and the final balance is settled when the channel is closed.

# Stream with per-token settlement against an open channel
tenzro inference stream gemma3-270m "Hello, world" --channel <channel_id>

# Without --channel: stream without billing (free / sponsored / public model)
tenzro inference stream gemma3-270m "Hello, world"

Underlying RPC: tenzro_chatStream. When channel_id is supplied, the provider verifies the channel is funded, signs each stream-state update against the payer key, and finalizes the cumulative debit on close. Without it, the stream returns ungated.

TEE-Attested Inference

Providers running in Trusted Execution Environments (TEE) can attest their inference results. This provides cryptographic proof that the output was computed securely without tampering.

// TEE-attested inference flow
1. User requests inference with require_tee: true
2. Provider runs model in trusted execution environment
3. TEE generates cryptographic attestation binding input→output
4. Response includes attestation proof
5. User verifies attestation to confirm secure execution

Attestation proves:
- Inference ran in genuine TEE hardware
- Output was computed from stated input
- No tampering with model or execution

TEE attestation binds the inference output to the specific model and provider. Users can verify that the output came from the claimed model running in a secure enclave, preventing model substitution attacks.

Zero-Knowledge Proof Verification

In addition to TEE attestation, providers can generate zero-knowledge proofs of correct inference execution. This enables verification without revealing the model weights.

// Zero-knowledge proof flow
1. Provider generates ZK proof of inference: proof(model_hash, input_hash, output_hash)
2. Proof confirms output was computed using registered model
3. Model weights remain private (zero-knowledge property)
4. User verifies proof without trusting provider hardware

Benefits over TEE:
- No hardware trust required
- Cryptographic verification
- Model weights never revealed

ZK proofs enable verification without trusting the provider hardware. The proof confirms that the output was computed using the registered model weights without revealing those weights.

Failover and Retry Logic

The inference router implements automatic failover when a provider is unavailable. If the primary provider fails, the router selects a backup provider using the same strategy.

// Failover configuration
max_retries: 3
timeout: 30 seconds
failover_enabled: true

// Automatic failover sequence:
1. Try primary provider (selected by routing strategy)
2. On failure, mark provider temporarily unavailable (circuit breaker)
3. Select backup provider using same strategy
4. Retry up to max_retries
5. Return error if all providers fail

Circuit breaker prevents cascading failures

Failover integrates with the circuit breaker pattern. Providers that fail consecutively are temporarily removed from the routing pool, preventing cascading failures.

Gossipsub Inference Protocol

Inference requests can be broadcast via the tenzro/inference/1.0.0 gossipsub topic. Multiple providers can compete to fulfill the request, with the fastest response winning.

// Broadcast inference request to network
{
  "type": "inference_request",
  "request_id": "req_123",
  "model_id": "0x1234...",
  "input": "Summarize this document: ...",
  "max_tokens": 200,
  "max_cost": 100000000000000000,  // 0.1 TNZO
  "requester": "0x5678...",
  "timestamp": 1711234567
}

// Provider responds via gossipsub
{
  "type": "inference_response",
  "request_id": "req_123",
  "output": "This document describes...",
  "cost": 75000000000000000,  // 0.075 TNZO
  "provider": "0xabcd...",
  "attestation": "0x...",
  "timestamp": 1711234570
}

Gossipsub-based inference enables decentralized request routing without a central router. Providers compete on speed and price, with the requester accepting the first valid response.

RPC Methods: tenzro_chat vs tenzro_inferenceRequest

The node exposes two inference methods. For chat applications and coding assistants, always prefer tenzro_chat as it applies the model's chat template for proper formatting.

Method	Chat Template	Use Case
`tenzro_chat`	Yes (auto-applied)	Chat apps, coding assistants, conversational AI
`tenzro_inferenceRequest`	No (raw prompt)	Batch processing, custom prompt formats, embeddings

// tenzro_chat — recommended for chat applications
// Uses the model's chat template (proper formatting per architecture)
curl -X POST https://rpc.tenzro.network \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tenzro_chat",
    "params": [{
      "model_id": "qwen3.5-0.8b",
      "message": "What is the capital of France?",
      "max_tokens": 200
    }],
    "id": 1
  }'

// tenzro_inferenceRequest — raw prompt, no chat template
curl -X POST https://rpc.tenzro.network \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tenzro_inferenceRequest",
    "params": [{
      "model_id": "gemma4-9b",
      "input": "What is the capital of France?",
      "strategy": "Cheapest",
      "max_tokens": 100
    }],
    "id": 1
  }'

Both methods accept a flat JSON object as params (not an array of arguments). The desktop app uses tenzro_chat to provide a ChatGPT-like interface with proper chat formatting.

Thinking Models and /no_think

Qwen 3, Qwen 3.5, and DeepSeek models have thinking/reasoning mode enabled by default. When thinking mode is active, the model produces <think>...</think> blocks containing chain-of-thought reasoning before the final answer.

// Thinking mode output (default for Qwen 3/3.5, DeepSeek):
// <think>
// The user is asking about the capital of France.
// The capital of France is Paris.
// </think>
// The capital of France is Paris.

// To disable thinking mode, append /no_think to your message:
{
  "jsonrpc": "2.0",
  "method": "tenzro_chat",
  "params": [{
    "model_id": "qwen3.5-0.8b",
    "message": "Write a hello world function /no_think",
    "max_tokens": 200
  }],
  "id": 1
}

// With /no_think, the model skips chain-of-thought and responds directly.
// This reduces token usage and latency for simple queries.

The /no_think suffix is handled automatically by the chat template. It only works with tenzro_chat, not tenzro_inferenceRequest (which does not apply chat templates).

Rich Shape: Multi-Turn, Tools, and Vision

tenzro_chat accepts two first-class request shapes. The simple shape (message: string) is for one-shot generation. The rich shape (messages: array) is built around content blocks and supports multi-turn conversations, system prompts, tool calling, vision input, and structured assistant responses. Presence of messages selects the rich shape; presence of message selects the simple shape. There is no silent upgrade or down-conversion — the wire format is preserved across network forwarding to peer providers.

// Rich shape — content blocks
{
  "jsonrpc": "2.0",
  "method": "tenzro_chat",
  "params": {
    "model": "qwen3-8b",
    "system": "You are a TNZO trading agent. Use tools when prices are needed.",
    "messages": [
      {"role": "user", "content": "What is TNZO trading at?"},
      {"role": "assistant", "content": [
        {"type": "thinking", "thinking": "I should query the price oracle."},
        {"type": "tool_use", "id": "tu_01", "name": "chainlink_get_price",
         "input": {"pair": "TNZO/USD"}}
      ]},
      {"role": "user", "content": [
        {"type": "tool_result", "tool_use_id": "tu_01", "content": "0.42"}
      ]}
    ],
    "tools": [
      {
        "name": "chainlink_get_price",
        "description": "Read a Chainlink price feed.",
        "input_schema": {
          "type": "object",
          "properties": {"pair": {"type": "string"}},
          "required": ["pair"]
        }
      }
    ],
    "reasoning_effort": "medium"
  },
  "id": 1
}

Five content block types are defined: text, thinking (assistant only), tool_use (assistant only), tool_result (user only), and image (user only, base64 PNG/JPEG/WebP). Tools are executed by the client, not the server — the node never invokes external tools on the model's behalf, keeping the trust boundary at the caller. Image blocks route only to models declaring modality: "vision" in their ModelInfo; routing an image to a non-vision model returns -32602.

Streaming uses tenzro_chatStream over SSE. The simple shape emits delta and done events. The rich shape emits an Anthropic-style event grammar: message_start, per-block content_block_start / content_block_delta / content_block_stop, then message_delta and message_stop, with ping keep-alives every 15s. The rich-shape SSE path currently emits one delta per block (final usage and ordering are correct); per-token rich streaming lands when ModelRuntime::generate_chat_with_tools_stream is implemented. The simple shape already streams per token.

See the full chat API specification at docs/chat-api.md for routing rules, tool execution loop semantics, content block schemas, billing, and authentication.

Specialized Inference: Multi-Modal Routing

Beyond chat, the node exposes specialized inference surfaces across seven modalities, all backed by ONNX runtimes (feature-gated behind onnx; default builds expose stubs). The InferenceRouter reads model.modality from the registry and dispatches a typed InferencePayload enum (Chat / Forecast / VisionEmbed / VisionSimilarity / TextEmbed / Segment / Detect / Transcribe / VideoEmbed) to the correct runtime handle. Pricing, latency, and reputation strategies apply per-modality with independent provider pools.

Timeseries forecasting. Drives Chronos-2, Chronos-Bolt, TimesFM 2.5, and Granite-TTM-r2 via TimeseriesRuntime. Input shape [1, context_len]; output [1, H] for point forecasts or [1, H, Q] for per-step quantiles. RPCs: tenzro_listForecastCatalog, tenzro_listForecastModels, tenzro_loadForecastModel, tenzro_unloadForecastModel, tenzro_forecast.

Vision embeddings. Drives CLIP ViT-B/32 + L/14, SigLIP2 base/large/so400m, DINOv3 vits16/vitb16/vitl16, and DINOv2 via VisionRuntime. Decode supports PNG/JPEG/WebP; resize uses Lanczos3; three normalization presets (clip, imagenet, siglip) cover the catalog. RPCs: tenzro_listVisionCatalog, tenzro_listVisionModels, tenzro_loadVisionModel, tenzro_unloadVisionModel, tenzro_imageEmbed, tenzro_imageTextSimilarity (zero-shot classification is computed via cosine-similarity argmax over tenzro_imageTextSimilarity).

Text embeddings. Text-only encoder serving Qwen3-Embedding 0.6B/4B/8B, EmbeddingGemma-300M (Matryoshka 768/512/256/128), BGE-M3, and Snowflake Arctic Embed L v2.0 via TextEmbeddingRuntime. Tokenizer loaded from HF tokenizer.json; optional dim truncation + re-normalization for Matryoshka models. RPCs: tenzro_listTextEmbeddingCatalog, tenzro_listTextEmbeddingModels, tenzro_loadTextEmbeddingModel, tenzro_unloadTextEmbeddingModel, tenzro_textEmbed.

Segmentation. Two-pass encoder/decoder runtime serving SAM 3 / 3.1, SAM 2 base/large, EdgeSAM, and MobileSAM via SegmentationRuntime. Encoder caches per-image embedding; decoder takes embedding + prompts (points/boxes) → masks. RPCs: tenzro_listSegmentationCatalog, tenzro_listSegmentationModels, tenzro_loadSegmentationModel, tenzro_unloadSegmentationModel, tenzro_segment.

Detection. NMS-free DETR-family runtime serving RF-DETR (nano/small/medium/base/large/2xl) and D-FINE (n/s/m/l/x) via DetectionRuntime. Returns {bbox, label_id, score} per detection. RPCs: tenzro_listDetectionCatalog, tenzro_listDetectionModels, tenzro_loadDetectionModel, tenzro_unloadDetectionModel, tenzro_detect.

Audio (ASR). Speech-to-text via Moonshine v2 tiny/base, Distil-Whisper small.en/medium.en/large-v3, Whisper-large-v3-turbo, Parakeet-TDT-0.6B-v3, and Canary-1B-Flash via AudioRuntime. WAV decode via hound, MP3/FLAC via symphonia, mel-spectrogram via realfft. Encoder/decoder/joiner triple-bundles for Parakeet/Canary. RPCs: tenzro_listAudioCatalog, tenzro_listAudioModels, tenzro_loadAudioModel, tenzro_unloadAudioModel, tenzro_transcribe.

Video. Frame-extraction (shell-out to ffmpeg) + per-frame embedding via vision-encoder fallback (DINOv3 / SigLIP2 mean-pooled across frames) via VideoRuntime. The native video catalog is empty in wave 1 — runtime scaffolding ships ready for future entries when a permissive ONNX-shippable encoder appears. RPCs: tenzro_listVideoCatalog, tenzro_listVideoModels, tenzro_loadVideoModel, tenzro_unloadVideoModel, tenzro_videoEmbed.

See the API Reference for full per-method shapes and the Models page for the bundled catalogs and license-tier gating.

Chat Interface

The desktop app and web interface provide a chat-style UI for multi-turn conversations. The chat history is maintained client-side and included in subsequent inference requests.

// Multi-turn conversation example

Turn 1:
Input: "What is quantum computing?"
Output: "Quantum computing uses quantum bits (qubits)..."

Turn 2 (includes history):
Input: "User: What is quantum computing?
        Assistant: Quantum computing uses quantum bits...
        User: Give an example"
Output: "For example, Shor's algorithm can factor large numbers..."

History is client-side to minimize costs. Only necessary context included.

Chat history management is client-side to minimize costs. Only the necessary context is included in each request, and older turns are pruned when the context window fills.

Inference History and Analytics

The desktop app tracks inference history locally via a Tauri command (get_inference_history) for cost analysis and usage monitoring. Each request_inferencecall records a timestamped entry with model id, provider, token counts, cost, and routing strategy. Users can view total spend, token counts, and provider distribution from the app's Inference page.

// Tauri command in apps/tenzro-desktop
import { invoke } from "@tauri-apps/api/core";

const history = await invoke<InferenceHistoryEntry[]>("get_inference_history");
// [
//   {
//     timestamp: 1711234567,
//     model_id: "qwen3-0.6b",
//     provider_id: "0xabcd...",
//     input_tokens: 1500,
//     output_tokens: 500,
//     cost: "0.0525",
//     strategy: "Cheapest"
//   },
//   ...
// ]

On-chain settlement receipts for each inference can be retrieved via tenzro_getSettlement; provider statistics across all served inferences are available via tenzro_providerStats.

← Models Next: Cross-chain Bridge →