In-cluster models

Nebula routes completions through llm.completion.mode (external | inCluster). Embeddings are Qwen 1024 only and run through the in-cluster vLLM/TEI-compatible embedding path. For AWS EKS enterprise deployments, the recommended in-cluster shape is:

Completions: vLLM serves Qwen/Qwen3.5-9B behind vllm-instruction.
Embeddings: TEI embedding serves Qwen/Qwen3-Embedding-0.6B behind tei-embedding, with 1024-dimensional vectors and fixed replicas by default.

External completions plus in-cluster embeddings is a common production topology: completions can stay on a frontier managed endpoint while the high-volume memory-search path stays inside the VPC.

When to enable in-cluster inference

In-cluster inference makes sense when:

Low-latency retrieval: embedding calls sit on the memory-search hot path and should avoid public API round trips.
Data residency: user data must stay inside the customer’s AWS account or Kubernetes network.
Air-gapped deployment: no internet egress is permitted; model images and weights are mirrored into private registries.
Embedding-volume cost: embedding traffic dominates model spend and the model fits on a small GPU pool.

Architecture

The chart uses separate serving stacks for the two model roles:

vLLM instruction Service — vllm-instruction.<namespace>.svc:8000 serves chat/completions when llm.completion.mode: inCluster.
TEI embedding Service — tei-embedding.<namespace>.svc:8000 serves embeddings when llm.embedding.mode: inCluster and teiEmbedding.enabled: true.
Endpoint env vars on API + worker pods — NEBULA_LLM_VLLM_API_BASE points at vLLM for completions; NEBULA_EMBEDDING_VLLM_API_BASE points at TEI embeddings. The runtime uses the vLLM embedding provider for TEI.

Set llm.inCluster.enabled: true for the in-cluster embedding path. The chart maps completion.mode to the matching in-image TOML profile and emits model endpoint overrides directly through env vars.

Sizing Reference

Role	Default model	Service	Node pool	CPU	Memory	GPU
Completions	`Qwen/Qwen3.5-9B`	`vllm-instruction`	customer GPU pool, g5/g6 class	4	16-32 GB	1 GPU with 24 GB VRAM
Embeddings	`Qwen/Qwen3-Embedding-0.6B`	`tei-embedding`	`worker-pool: embedding-gpu`, G6/L4 class	4	8-16 GB	1 GPU

Qwen3.5 enables thinking mode by default. Add extraArgs: ["--reasoning-parser", "qwen3"] on the instruction profile so vLLM parses the <think>...</think> blocks into structured response fields rather than streaming them as raw text. The default EKS in-cluster overlay sets this already. TEI embeddings should run on the dedicated embedding-gpu Karpenter NodePool. The TEI StatefulSet uses per-pod model cache PVCs so each replica keeps a warm model cache; Karpenter adds G6 nodes as pending embedding pods request GPUs.

HuggingFace Token Provisioning

The default EKS models do not require a HuggingFace access token:

Qwen/Qwen3.5-9B — publicly available, no token required
Qwen/Qwen3-Embedding-0.6B — publicly available, no token required

If you choose a gated model, create a Kubernetes Secret in the release namespace:

kubectl -n nebula create secret generic nebula-hf-token \
  --from-literal=HF_TOKEN=hf_...

Then enable token injection for vLLM profiles:

vllm:
  global:
    hfToken:
      enabled: true
      secretName: nebula-hf-token
      secretKey: HF_TOKEN

TEI embedding uses the same model registry access pattern; air-gapped clusters should mirror the TEI image and model artifacts to private infrastructure and override teiEmbedding.image.

Enabling On EKS

The bundle ships an EKS overlay at helm/examples/eks/values-vllm-inCluster.yaml. Stack it on top of the base values file:

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/eks/values.yaml \
  -f helm/examples/eks/values-vllm-inCluster.yaml

The overlay sets both llm.completion.mode and llm.embedding.mode to inCluster, enables the vLLM instruction profile, and enables one TEI embedding replica:

teiEmbedding:
  enabled: true
  modelId: Qwen/Qwen3-Embedding-0.6B
  dimension: 1024
  replicas: 1
  nodeSelector:
    worker-pool: embedding-gpu

Override resources, image mirrors, and node placement in your own -f my-values.yaml after the overlay; pass the same -f flags to helm upgrade. Embedding identity is fixed to Qwen/Qwen3-Embedding-0.6B at 1024 dimensions.

teiEmbedding:
  resources:
    requests:
      cpu: "4"
      memory: 12Gi
      nvidia.com/gpu: "1"

Changing the embedding model or dimension after collections exist is a data migration: existing vectors and catalog rows were written with the previous identity and dimension.

Topologies

Embeddings always run in-cluster:

Completion mode	Embedding mode	Typical use
`external`	`inCluster`	managed frontier completions with in-VPC memory search
`inCluster`	`inCluster`	fully in-cluster inference

The chart maps completion mode to NEBULA_CONFIG_NAME automatically. When teiEmbedding.enabled: true, embedding topologies select Qwen3/TEI config profiles by default. Completion model overrides flow through NEBULA_LLM_<provider>_MODEL; embedding model and dimension are fixed to Qwen3-Embedding at 1024 dimensions. Embedding providers are not configurable. External completions still use their LLM env-var family; embeddings use the VLLM-compatible env-var family for the fixed Qwen path.

Troubleshooting

TEI embedding pods stay Pending

Check kubectl describe pod <tei-embedding-pod> -n nebula. Common causes are a missing embedding-gpu NodePool, no NVIDIA device plugin, a node selector that does not match worker-pool: embedding-gpu, or a missing toleration for the pool taint.

Embedding endpoint timing out

Verify the Service has ready endpoints: kubectl -n nebula get endpoints tei-embedding. If the endpoint list is empty, inspect the StatefulSet and pod logs with kubectl -n nebula describe sts tei-embedding and kubectl -n nebula logs sts/tei-embedding.

TEI embedding needs more throughput

Increase teiEmbedding.replicas and confirm Karpenter can provision one G6/L4 node per pending GPU replica. Enable the optional CPU HPA only after validating it tracks your workload; GPU saturation is usually better handled by queue-depth, latency, or DCGM/KEDA metrics.

vLLM instruction pod stays Pending

Check that the instruction profile’s GPU node selector and tolerations match the customer GPU NodePool, and verify the NVIDIA device plugin is installed so nvidia.com/gpu is allocatable.

Model download stuck

First boot downloads model artifacts from Hugging Face. If pods log connection errors, allow outbound HTTPS to huggingface.co or mirror the image and model artifacts into private infrastructure. If logs show 401 Unauthorized, provision an HF_TOKEN secret.

Get Started

Kubernetes

Docker Compose

Connectors

Reference

In-cluster models

When to enable in-cluster inference

Architecture

Sizing Reference

HuggingFace Token Provisioning

Enabling On EKS

Topologies

Troubleshooting

​When to enable in-cluster inference

​Architecture

​Sizing Reference

​HuggingFace Token Provisioning

​Enabling On EKS

​Topologies

​Troubleshooting

When to enable in-cluster inference

Architecture

Sizing Reference

HuggingFace Token Provisioning

Enabling On EKS

Topologies

Troubleshooting