Embed Images with DINOv3
DINOv3 is Meta's self-supervised vision encoder — one of the strongest open image-embedding models available. Tenzro's vision runtime serves DINOv3 (vits16 / vitb16 / vitl16), DINOv2, CLIP, and SigLIP2 (base / large / so400m). This tutorial downloads dinov3-vitb16, embeds an image, and compares two images by cosine similarity.
License gate. DINOv3, SAM family models, and Gemma models carry the CommercialCustom license tier. The model registry refuses to load these without an explicit --accept-license <id> flag, recorded per-family in on-disk state. Permissive (Apache-2.0 / MIT) and Attribution (CC-BY-4.0) models load by default.
1. Download with license acceptance
# DINOv3 ships under Meta's CommercialCustom license — explicit accept required
tenzro model download dinov3-vitb16 --accept-license dinov3
# Output:
# License gate: dinov3 (CommercialCustom)
# Terms: https://ai.meta.com/resources/models-and-libraries/dinov3-license
# Acknowledgment: recorded in CF_MODELS
# Resolving artifact bundle from HuggingFace Hub...
# Source: facebook/dinov3-vitb16 (ONNX export)
# Files: model.onnx
# SHA-256 verified
# Saved to: ~/.tenzro/models/dinov3-vitb16/2. Load into the vision runtime
# Load into the vision runtime
tenzro vision load dinov3-vitb16
# Output:
# Vision runtime loaded:
# Model: dinov3-vitb16
# Modality: vision
# Output dim: 768
# Normalization: ImageNet (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
# L2-normalized: true3. Embed an image
The runtime decodes PNG/JPEG/WebP via the imagecrate, Lanczos3-resizes to the encoder's expected resolution, applies ImageNet normalization, and runs the ONNX session. The output is L2-normalized so cosine similarity reduces to a dot product.
# Embed a single image (PNG, JPEG, or WebP — auto-detected)
tenzro embed-image \
--model dinov3-vitb16 \
--image ./photo.jpg
# Output:
# Embedding (dim=768, L2-normalized):
# [0.0142, -0.0731, 0.0418, ..., 0.0294]
# latency_ms: 384. Compare two images
# Compute cosine similarity between two images directly
tenzro vision similarity \
--model dinov3-vitb16 \
--image-a ./cat1.jpg \
--image-b ./cat2.jpg
# Output:
# cosine_similarity: 0.847
# latency_ms: 715. Embed via JSON-RPC
# Equivalent JSON-RPC call. Image is base64-encoded raw bytes.
IMAGE_B64=$(base64 -i ./photo.jpg)
curl https://rpc.tenzro.network \
-X POST \
-H "Content-Type: application/json" \
-d "{
\"jsonrpc\": \"2.0\",
\"id\": 1,
\"method\": \"tenzro_visionEmbed\",
\"params\": {
\"model_id\": \"dinov3-vitb16\",
\"image_base64\": \"$IMAGE_B64\"
}
}" | jqA typical response:
{
"jsonrpc": "2.0",
"id": 1,
"result": {
"model_id": "dinov3-vitb16",
"embedding": [0.0142, -0.0731, 0.0418, "..."],
"dim": 768,
"l2_normalized": true,
"latency_ms": 38
}
}See also
- Model serving documentation — license tiers, modality routing, the vision catalog
- Inference RPC reference—
tenzro_visionEmbed,tenzro_visionSimilarity,tenzro_visionClassify - Embed text with Qwen3-Embedding — pair text embeddings with image embeddings for retrieval