Tenzro Testnet is live —request testnet TNZO
← Back to Tutorials

Embed Images with DINOv3

VisionIntermediate15 min

DINOv3 is Meta's self-supervised vision encoder — one of the strongest open image-embedding models available. Tenzro's vision runtime serves DINOv3 (vits16 / vitb16 / vitl16), DINOv2, CLIP, and SigLIP2 (base / large / so400m). This tutorial downloads dinov3-vitb16, embeds an image, and compares two images by cosine similarity.

License gate. DINOv3, SAM family models, and Gemma models carry the CommercialCustom license tier. The model registry refuses to load these without an explicit --accept-license <id> flag, recorded per-family in on-disk state. Permissive (Apache-2.0 / MIT) and Attribution (CC-BY-4.0) models load by default.

1. Download with license acceptance

# DINOv3 ships under Meta's CommercialCustom license — explicit accept required
tenzro model download dinov3-vitb16 --accept-license dinov3

# Output:
# License gate: dinov3 (CommercialCustom)
#   Terms: https://ai.meta.com/resources/models-and-libraries/dinov3-license
#   Acknowledgment: recorded in CF_MODELS
# Resolving artifact bundle from HuggingFace Hub...
#   Source: facebook/dinov3-vitb16 (ONNX export)
#   Files: model.onnx
#   SHA-256 verified
#   Saved to: ~/.tenzro/models/dinov3-vitb16/

2. Load into the vision runtime

# Load into the vision runtime
tenzro vision load dinov3-vitb16

# Output:
# Vision runtime loaded:
#   Model: dinov3-vitb16
#   Modality: vision
#   Output dim: 768
#   Normalization: ImageNet (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
#   L2-normalized: true

3. Embed an image

The runtime decodes PNG/JPEG/WebP via the imagecrate, Lanczos3-resizes to the encoder's expected resolution, applies ImageNet normalization, and runs the ONNX session. The output is L2-normalized so cosine similarity reduces to a dot product.

# Embed a single image (PNG, JPEG, or WebP  auto-detected)
tenzro embed-image \
  --model dinov3-vitb16 \
  --image ./photo.jpg

# Output:
# Embedding (dim=768, L2-normalized):
#   [0.0142, -0.0731, 0.0418, ..., 0.0294]
#   latency_ms: 38

4. Compare two images

# Compute cosine similarity between two images directly
tenzro vision similarity \
  --model dinov3-vitb16 \
  --image-a ./cat1.jpg \
  --image-b ./cat2.jpg

# Output:
# cosine_similarity: 0.847
# latency_ms: 71

5. Embed via JSON-RPC

# Equivalent JSON-RPC call. Image is base64-encoded raw bytes.
IMAGE_B64=$(base64 -i ./photo.jpg)

curl https://rpc.tenzro.network \
  -X POST \
  -H "Content-Type: application/json" \
  -d "{
    \"jsonrpc\": \"2.0\",
    \"id\": 1,
    \"method\": \"tenzro_visionEmbed\",
    \"params\": {
      \"model_id\": \"dinov3-vitb16\",
      \"image_base64\": \"$IMAGE_B64\"
    }
  }" | jq

A typical response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "model_id": "dinov3-vitb16",
    "embedding": [0.0142, -0.0731, 0.0418, "..."],
    "dim": 768,
    "l2_normalized": true,
    "latency_ms": 38
  }
}

See also