Tenzro Testnet is live. Get testnet TNZO

Tenzro Execution Layer

The runtime that knows how to run a model, not just which one to run

Tenzro's execution runtime serves AI models across a distributed provider network. Providers advertise what they can actually execute. The scheduler matches requests to the right provider, the right execution mode, and the right KV policy — based on the model's topology, the provider's hardware, the available artifacts, and the request's latency and trust constraints.

How it works

Every model in the Tenzro network has a manifest — a machine-readable description of its topology, the execution modes it supports, the artifacts that exist for it, and the constraints that apply when scheduling it. Every provider declares what its hardware can run and which execution modes it supports.

When a request arrives, the Capability Resolver checks what is actually available: which modes the model supports, which of those modes have healthy providers with the right artifacts, and which modes the request policy allows. The Execution Planner then produces a concrete plan — execution mode, coordinator, workers, KV profile, and fallback path.

The session is then executed by the appropriate runtime engine. Session ownership — the decode loop, KV state, sampling, output assembly — stays anchored to the coordinator. Execution work can be delegated: full remote execution, streamed weight loading, expert dispatch for MoE models.

Execution modes

The runtime supports seven execution modes. The scheduler selects the best available mode based on the model class, available providers, and request policy. Providers only advertise modes they can actually satisfy.

LOCAL_FULL

Local full execution

The model runs in full on the local device. Lowest latency, no network dependency in the execution path. The scheduler prefers this mode when the local provider has sufficient capacity.

REMOTE_FULL

Remote full execution

The full model runs on a remote provider. The provider must hold the complete model weights and have the hardware capacity to serve it. Standard path for most network inference requests.

LOCAL_STREAMED

Local streamed execution

Weights are loaded in stages from disk rather than kept fully resident in memory. Allows a local device to serve a model larger than its available VRAM. Carries degraded latency compared to full-resident execution and is surfaced as such in the UI and receipts.

REMOTE_STREAMED

Remote streamed execution

A remote provider serves the model via staged weight loading. Used when no full-capacity provider is available. The runtime marks this as a degraded-latency path and will not schedule it for interactive traffic unless no better option exists.

HYBRID_MOE_LOCAL

Hybrid MoE — local

The coordinator owns the session. Expert execution is dispatched to workers on the same machine or in the same tightly coupled environment. Used for MoE models where expert shards are available locally and latency constraints are tight.

HYBRID_MOE_REGIONAL

Hybrid MoE — regional

The coordinator owns the session. Expert workers are on separate providers within a low-latency regional boundary. The preferred mode for frontier MoE models on interactive requests. The scheduler enforces a hard regional latency bound — wide-area expert routing is not allowed on interactive paths.

AUTO

Auto

The runtime selects the best available mode based on the model class, available providers, artifact completeness, and request policy. Requesters can specify latency and trust constraints; the planner resolves the rest. If the model manifest requires at least one adaptive mode and none is available, the request fails with an explicit reason rather than silently falling back to an incompatible path.

Model classes

The scheduler treats different model classes differently. A small dense model and a frontier MoE model have different realistic serving paths — the runtime encodes that distinction rather than pretending every model can be served the same way.

ClassScheduling priorityDefault KV policy
SMALL_DENSE / MID_DENSELOCAL_FULL → REMOTE_FULL → LOCAL_STREAMED → REMOTE_STREAMEDkv_raw
LARGE_DENSELOCAL_FULL → REMOTE_FULL → LOCAL_STREAMED → REMOTE_STREAMEDkv_raw or kv_int8
SMALL_MOE / LARGE_MOELOCAL_FULL → HYBRID_MOE_LOCAL → HYBRID_MOE_REGIONAL → REMOTE_FULL → REMOTE_STREAMEDkv_int8
FRONTIER_MOEHYBRID_MOE_REGIONAL → REMOTE_FULL (high-capacity only) → REMOTE_STREAMEDkv_int8 (enforced)

For frontier MoE models, the runtime enforces compressed KV by default and rejects providers that cannot satisfy at least one adaptive execution mode. Advertising realistic serving capacity is a hard constraint — incapable providers cannot claim support they do not have.

Artifact model

Execution mode availability is tied to artifact availability. A model is not considered network-ready for a given mode unless the required artifacts exist on healthy providers. The model manifest declares its artifact completeness level.

Artifact types

  • full_weightsComplete model weights for full-residency execution
  • streaming_shardWeight layout optimized for staged loading
  • expert_shardIndividual expert weights for MoE dispatch
  • kv_profileKV cache configuration for compressed state management
  • tokenizerTokenizer bundle
  • backend_bundleRuntime-specific execution backend
  • quant_variantQuantized weight variant

Artifact completeness

  • FULL_ONLY
    Only full weights exist. Supports full execution only.
  • STREAMING_READY
    Full weights plus streaming-compatible layout. Enables streamed execution paths.
  • MOE_READY
    Full weights plus expert-aware artifacts. Enables MoE execution paths.
  • FULLY_ADAPTIVE
    Full, streaming, and MoE-aware artifacts. All execution modes available.

KV state management

Every session has an explicit KV profile. The KV State Manager handles allocation, compression, paging, and snapshotting — giving the runtime control over memory pressure and long-context behavior.

kv_raw

Uncompressed KV. Lowest overhead, highest memory footprint.

kv_int8

Quantized int8 KV cache. Default for MoE and frontier models. Supports snapshot and restore.

kv_mixed

Mixed precision KV. Balances accuracy and memory pressure.

kv_paged

Paged KV allocation. Preferred for long-context workloads. Hot/cold paging of KV blocks.

Capability resolution and execution planning

Capability Resolver

Before a plan is created, the Capability Resolver determines what is actually available. It evaluates three layers:

  • Static support — does the model manifest declare support for this mode?
  • Dynamic availability — are the required artifacts on healthy providers right now?
  • Policy eligibility — does the request allow this mode given its latency and trust constraints?

The resolver returns supported modes, available modes, and disqualified modes with explicit reasons. Nothing is silently dropped.

Execution Planner

The Execution Planner takes the capability resolution and produces a concrete execution plan:

  • Which execution mode to use
  • Which provider is the coordinator (session owner)
  • Which workers are assigned, with roles and artifact bindings
  • Which KV profile to use
  • Fallback path if a provider drops mid-session

Example execution plan — frontier MoE model

{
  "session_id": "sess-123",
  "model_id": "minimax-m2.5",
  "mode": "HYBRID_MOE_REGIONAL",
  "coordinator_provider_id": "provider-a",
  "workers": [
    { "provider_id": "provider-a", "role": "EXPERT_WORKER", "artifact_ids": ["artifact-expert-07"] },
    { "provider_id": "provider-b", "role": "EXPERT_WORKER", "artifact_ids": ["artifact-expert-19"] }
  ],
  "kv_profile_id": "kv_int8",
  "fallback": {
    "on_provider_loss": "REMOTE_STREAMED",
    "on_expert_timeout": "REMOTE_FULL"
  }
}

Execution receipts

Every session emits a receipt. Receipts carry the execution mode used, token counts, provider, expert call counts, streamed layer loads, KV profile, KV memory time, and attestation reference if the provider was TEE-attested. This makes metering mode-aware — a streamed execution session and a full-residency session are billed and verified differently.

{
  "receipt_id": "rcpt-1",
  "session_id": "sess-123",
  "model_id": "minimax-m2.5",
  "provider_id": "provider-a",
  "mode": "HYBRID_MOE_REGIONAL",
  "prompt_tokens": 812,
  "completion_tokens": 284,
  "expert_calls": 2272,
  "streamed_layer_loads": 0,
  "kv_profile_id": "kv_int8",
  "kv_memory_ms": 48842,
  "attested": true,
  "attestation_ref": "tee:amd-sev-snp:xyz",
  "output_hash": "sha256:def456..."
}

Provider roles

Providers declare which worker roles they support. A single provider can serve multiple roles. The planner assigns roles per-session based on capability declarations and artifact availability.

FULL_WORKER

Executes the complete model in full residency. Required for LOCAL_FULL and REMOTE_FULL modes.

STREAMING_WORKER

Executes via staged weight loading. Required for LOCAL_STREAMED and REMOTE_STREAMED modes.

EXPERT_WORKER

Executes specific expert shards for MoE models. Required for HYBRID_MOE_LOCAL and HYBRID_MOE_REGIONAL modes.

PREFILL_WORKER

Handles prefill-heavy jobs that can be offloaded from the coordinator to reduce session setup latency.

API

The runtime exposes endpoints for model capability inspection, availability resolution, execution planning, and receipt submission.

GET/v1/models/:id/capabilitiesSupported modes, artifact completeness, default KV profile
GET/v1/models/:id/artifactsArtifact list with health and locality
POST/v1/models/:id/availability:resolveAvailable modes given latency and trust policy
POST/v1/providers/:id/capabilitiesRegister provider runtime capabilities and hardware profile
POST/v1/execution-plansCreate an execution plan for a session
POST/v1/receiptsSubmit an execution receipt for metering and settlement

CLI

The Tenzro CLI surfaces execution mode throughout the model lifecycle — from download to serve to inspect to inference.

# Inspect a model's execution capabilities
tenzro-cli model inspect --model minimax-m2.5

# Check live network availability by mode
tenzro-cli model availability --model minimax-m2.5 --latency interactive --trust attested-only

# Download with specific artifact types
tenzro-cli model download --model minimax-m2.5 --artifacts full,streaming,experts

# Serve with explicit execution modes
tenzro-cli model serve --model minimax-m2.5 --modes remote-full,remote-streamed,hybrid-moe

# Request inference with mode and policy
tenzro-cli inference request \
  --model minimax-m2.5 \
  --mode auto \
  --latency interactive \
  --trust attested-only

Build on the Tenzro runtime

Serve models, register provider capabilities, and route inference with full visibility into execution mode, artifact availability, and session state.