Skip to main content
Nebula routes model calls through two independent knobs: llm.completion.mode and llm.embedding.mode, each external | inCluster. Default for both is external (cloud OpenAI-compatible endpoint). When a workload has strict latency, data-residency, cost, or air-gap requirements, run either role inside Kubernetes. For AWS EKS enterprise deployments, the recommended in-cluster shape is:
  • Completions: vLLM serves Qwen/Qwen3.5-9B behind vllm-instruction.
  • Embeddings: TEI embedding serves Qwen/Qwen3-Embedding-0.6B behind tei-embedding, with 1024-dimensional vectors and fixed replicas by default.
External completions plus in-cluster embeddings is a common production topology: completions can stay on a frontier managed endpoint while the high-volume memory-search path stays inside the VPC.

When to enable in-cluster inference

In-cluster inference makes sense when:
  • Low-latency retrieval: embedding calls sit on the memory-search hot path and should avoid public API round trips.
  • Data residency: user data must stay inside the customer’s AWS account or Kubernetes network.
  • Air-gapped deployment: no internet egress is permitted; model images and weights are mirrored into private registries.
  • Embedding-volume cost: embedding traffic dominates model spend and the model fits on a small GPU pool.

Architecture

The chart uses separate serving stacks for the two model roles:
  1. vLLM instruction Servicevllm-instruction.<namespace>.svc:8000 serves chat/completions when llm.completion.mode: inCluster.
  2. TEI embedding Servicetei-embedding.<namespace>.svc:8000 serves embeddings when llm.embedding.mode: inCluster and teiEmbedding.enabled: true.
  3. Endpoint env vars on API + worker podsNEBULA_LLM_VLLM_API_BASE points at vLLM for completions; NEBULA_EMBEDDING_VLLM_API_BASE points at TEI embeddings. The runtime uses the same OpenAI-compatible embedding provider family for TEI.
Set llm.inCluster.enabled: true whenever either role is in-cluster. The chart maps (completion.mode, embedding.mode) to the matching in-image TOML profile and emits model endpoint overrides directly through env vars.

Sizing Reference

RoleDefault modelServiceNode poolCPUMemoryGPU
CompletionsQwen/Qwen3.5-9Bvllm-instructioncustomer GPU pool, g5/g6 class416-32 GB1 GPU with 24 GB VRAM
EmbeddingsQwen/Qwen3-Embedding-0.6Btei-embeddingworker-pool: embedding-gpu, G6/L4 class48-16 GB1 GPU
Qwen3.5 enables thinking mode by default. Add extraArgs: ["--reasoning-parser", "qwen3"] on the instruction profile so vLLM parses the <think>...</think> blocks into structured response fields rather than streaming them as raw text. The default EKS in-cluster overlay sets this already. TEI embeddings should run on the dedicated embedding-gpu Karpenter NodePool. The TEI StatefulSet uses per-pod model cache PVCs so each replica keeps a warm model cache; Karpenter adds G6 nodes as pending embedding pods request GPUs.

HuggingFace Token Provisioning

The default EKS models do not require a HuggingFace access token:
  • Qwen/Qwen3.5-9B — publicly available, no token required
  • Qwen/Qwen3-Embedding-0.6B — publicly available, no token required
If you choose a gated model, create a Kubernetes Secret in the release namespace:
kubectl -n nebula create secret generic nebula-hf-token \
  --from-literal=HF_TOKEN=hf_...
Then enable token injection for vLLM profiles:
vllm:
  global:
    hfToken:
      enabled: true
      secretName: nebula-hf-token
      secretKey: HF_TOKEN
TEI embedding uses the same model registry access pattern; air-gapped clusters should mirror the TEI image and model artifacts to private infrastructure and override teiEmbedding.image.

Enabling On EKS

The bundle ships an EKS overlay at helm/examples/eks/values-vllm-inCluster.yaml. Stack it on top of the base values file:
helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/eks/values.yaml \
  -f helm/examples/eks/values-vllm-inCluster.yaml
The overlay sets both llm.completion.mode and llm.embedding.mode to inCluster, enables the vLLM instruction profile, and enables one TEI embedding replica:
teiEmbedding:
  enabled: true
  modelId: Qwen/Qwen3-Embedding-0.6B
  dimension: 1024
  replicas: 1
  nodeSelector:
    worker-pool: embedding-gpu
Override resources, image mirrors, and node placement in your own -f my-values.yaml after the overlay; pass the same -f flags to helm upgrade. To choose a different TEI-supported embedding model at install time, override the model identity and resource shape together. With TEI enabled, teiEmbedding.modelId, teiEmbedding.servedModelName, and teiEmbedding.dimension are the only supported embedding model knobs; do not set llm.embedding.model or llm.embedding.dimension.
teiEmbedding:
  modelId: <hugging-face-model-or-local-path>
  servedModelName: <model-name-nebula-requests>
  dimension: <embedding-dimension>
  resources:
    requests:
      cpu: "4"
      memory: 12Gi
      nvidia.com/gpu: "1"
Changing the embedding model or dimension after collections exist is a data migration: existing vectors and catalog rows were written with the previous identity and dimension.

Topologies

All four mode combinations are first-class:
Completion modeEmbedding modeTypical use
externalexternalsimplest managed deployment
externalinClustermanaged frontier completions with in-VPC memory search
inClusterexternallocal completions with managed embeddings
inClusterinClusterfully in-cluster inference
The chart maps these modes to NEBULA_CONFIG_NAME automatically. When teiEmbedding.enabled: true, in-cluster embedding topologies select Qwen3/TEI config profiles by default. Model + dimension overrides flow through env vars (NEBULA_LLM_<provider>_MODEL, NEBULA_EMBEDDING_<provider>_MODEL, NEBULA_EMBEDDING_<provider>_DIMENSION). The default TEI model is Qwen3-Embedding at 1024 dimensions; choose a different identity only before ingestion starts, or migrate the catalog and vectors together. Non-OpenAI external providers are not supported via configName alone. The chart emits OpenAI env-var families for mode: external and VLLM-compatible env-var families for mode: inCluster; using a different external provider requires matching env-var wiring.

Troubleshooting

Check kubectl describe pod <tei-embedding-pod> -n nebula. Common causes are a missing embedding-gpu NodePool, no NVIDIA device plugin, a node selector that does not match worker-pool: embedding-gpu, or a missing toleration for the pool taint.
Verify the Service has ready endpoints: kubectl -n nebula get endpoints tei-embedding. If the endpoint list is empty, inspect the StatefulSet and pod logs with kubectl -n nebula describe sts tei-embedding and kubectl -n nebula logs sts/tei-embedding.
Increase teiEmbedding.replicas and confirm Karpenter can provision one G6/L4 node per pending GPU replica. Enable the optional CPU HPA only after validating it tracks your workload; GPU saturation is usually better handled by queue-depth, latency, or DCGM/KEDA metrics.
Check that the instruction profile’s GPU node selector and tolerations match the customer GPU NodePool, and verify the NVIDIA device plugin is installed so nvidia.com/gpu is allocatable.
First boot downloads model artifacts from Hugging Face. If pods log connection errors, allow outbound HTTPS to huggingface.co or mirror the image and model artifacts into private infrastructure. If logs show 401 Unauthorized, provision an HF_TOKEN secret.