Prereqs
Beforehelm install, the following must be in place on the cluster side. If you’ve never set these up on an EKS cluster, budget half a day; each is well-documented upstream.
Cluster
- EKS 1.30+ (matches what we run internally)
- OIDC provider associated with the cluster (
eksctl utils associate-iam-oidc-provider --cluster <name> --approve) — required for IRSA
Addons + controllers
| Component | Purpose | Install reference |
|---|---|---|
| Karpenter | Node autoscaling | karpenter.sh/docs |
| AWS Load Balancer Controller | ALB ingress | aws-load-balancer-controller |
| EBS CSI Driver | gp3 volumes for graph-engine / compactor / RabbitMQ | EKS addon: aws-ebs-csi-driver |
| Metrics Server | HPA metrics for API, workers, and optional CPU-backed workload HPAs | metrics-server |
| NVIDIA device plugin | Exposes GPU resources for vLLM and TEI embedding pods | nvidia/k8s-device-plugin |
| External Secrets Operator (recommended) | Sync from AWS Secrets Manager | external-secrets.io |
NodePool covering the instance families Nebula will run on. Our staging clusters use m6i, m7i, and c7i families; production also includes r7i for the graph-engine memory profile. The chart’s resource requests on the example values file fit comfortably in m7i.large and up.
AWS-managed resources (recommended)
- RDS Postgres 16 in the same VPC as the cluster, with the cluster’s node security group allowed inbound on
:5432. The Enterprise bundle includesinfra/aws-rds-postgresif you want Terraform/OpenTofu to create the RDS instance and parameter group. If you bring your own RDS instance, makevector,pg_partman, andpg_cronavailable to the master user, setshared_preload_libraries=pg_cron, and setcron.database_nameto the Nebula database name. - S3 bucket in the same region as the cluster. Versioning + SSE-S3 (or SSE-KMS) recommended.
IAM role for IRSA
Create one IAM role with the cluster’s OIDC provider in its trust policy, scoped to the chart’s ServiceAccount (nebula-sa in the install namespace when you helm install nebula …; if you pick a different release name, the SA is <release>-nebula-sa — confirm with kubectl -n <ns> get sa after install). Attach an inline policy granting:
serviceAccount.annotations.eks.amazonaws.com/role-arn in your values file.
Install
The fastest path is the guided AWS workspace generator. It writes the RDS Terraform inputs, installer env, and phase scripts into one operator-owned directory:--eso-secret-path, create or update RDS, append the generated Postgres connection settings to nebula-install.env, fetch the RDS-managed admin password into PGPASSWORD, bootstrap and verify the Nebula and Hatchet databases, run Helm, and verify the rollout. Export provider keys such as OPENAI_API_KEY before 00-seed-app-secret.sh; to update an existing secret, edit nebula-prod-install/app-secret.json and rerun with UPDATE_APP_SECRET=1.
For custom infrastructure or a bring-your-own database, use the lower-level wrapper directly:
1. Import images to ECR
The bundle’simages.tar contains every pinned image and bundle-manifest.json maps those images to Helm values. Use the bundled helper to load, retag, push, and generate digest-pinned image values:
nebulaRuntime, graphEngine, and bundled Nebula postgres). For air-gapped EKS with no public-registry egress, add --scope all to mirror Hatchet, RabbitMQ, busybox, and Hatchet Postgres too. If you mirror images manually instead of using the helper, use the source references from bundle-manifest.json; BusyBox is shipped as public.ecr.aws/docker/library/busybox:1.37.0.
2. Seed secrets in AWS Secrets Manager [#secrets-bootstrap]
If you’re using ESO (recommended), put one JSON blob at the path you’ll reference undersecrets.esoAws.awsSecretPath:
NEBULA_JWT_RETIRED_PUBLIC_KEYS_JSON can stay [] on a fresh install. Populate it only during JWT signing-key rotation; see Service authentication.
When you run the guided PROVISION_POSTGRES=1 path, nebula-enterprise postgres provision creates the Postgres credential Kubernetes Secrets directly and the installer immediately runs nebula-enterprise postgres verify against the same contract. If you instead sync database credentials from AWS Secrets Manager, keep them separate from the app-secret JSON, preserve the chart’s two Secret shapes, and run nebula-enterprise postgres verify before Helm install:
-
postgres.credentialsSecret: Kubernetes Secret withusername+passwordkeys (those exact lowercase key names — the chart reads them viasecretKeyRef.key: username/.key: password). -
hatchetPostgres.credentialsSecret: Kubernetes Secret with a singledatabase_urlkey (override the key name viahatchetPostgres.databaseUrlKey) holding the full pre-encoded DSN, including?sslmode=...and&sslrootcert=/etc/ssl/hatchet-postgres-ca.crtwhen TLS verification is required. Hatchet v0.79 readsDATABASE_URLdirectly and does not URL-encode discrete fields itself, so the chart relies on the operator to provide a properly-encoded URL. If your secret backend stores discrete fields, derivedatabase_urlat sync time — the External Secrets Operatortarget.template.datadirective is the standard pattern:
3. Copy + fill the reference values file
The bundle ships a reference values file athelm/examples/eks/values.yaml with every AWS-specific knob pre-wired (Karpenter, gp3, IRSA, ALB, RDS+S3, ESO). Copy it, fill in the <placeholder> markers (account ID, RDS endpoint, IRSA role ARN, ACM cert ARN, S3 bucket name), and save as your-values.yaml.
4. Install
_common/production-sizing.yaml is the shared production-shape sizing block (replicas, CPU/memory requests + limits, persistence) used by all three cloud-managed K8s examples (EKS/AKS/GKE). Omit it to keep the chart’s minimal-dev defaults; override per-workload in your-values.yaml to fit your cluster.
The chart runs schema migrations and catalog-apply automatically via a per-revision Job (<release>-nebula-migrations-<revision>); API and worker pods gate startup on an init container that polls public.nebula_release_contract for the install’s release row. The Job requires releaseContract.releaseId and releaseContract.gitSha to be set — both are stamped by bundle.sh into the bundled values file and consumed automatically; no manual intervention is needed on a standard install. If you supply your own values file without those keys, the chart fails at template time rather than rendering a malformed migration.
5. Verify
Upgrade
Pull the new bundle, push new images to your ECR, then:Sizing reference
The example values file ships with production-shape defaults for a starter deployment. Scale from there based on measured throughput:| Workload | Starter | When to scale |
|---|---|---|
| API | 2 replicas, 1 CPU / 2-4 GB | HPA on CPU >70% sustained |
| Worker | 2 replicas, 2 CPU / 4-8 GB | HPA on queue depth (Hatchet metric) |
| Graph engine | 2 replicas, 2 CPU / 4-8 GB | Manual; restart-sensitive (WAL replay) |
| Compactor | 1 replica, 1 CPU / 2-4 GB | Single-writer; do not scale horizontally |
| RabbitMQ | 1 replica, 8 GB PVC | Single-broker is fine up to ~10k workflows/min |
Karpenter + long-running pods
The example values file setskarpenter.enabled=true, which adds karpenter.sh/do-not-disrupt: "true" to the API, worker, graph-engine, compactor, Hatchet engine, and TEI embedding pods. This prevents Karpenter consolidation or drift from killing pods mid-ingest, mid-graph-build, mid-snapshot, or during a warm embedding-cache window. Pods still drain on actual node lifecycle events (rolling update, manual kubectl drain).
If you’re on Cluster Autoscaler instead of Karpenter, leave karpenter.enabled=false. CA respects PDBs by default and doesn’t need the annotation.
Karpenter NodePools
The bundle ships ready-to-apply Karpenter resources underhelm/examples/eks/karpenter/. That directory contains EC2NodeClass and NodePool manifests for the following pool shapes:
| Pool | Purpose |
|---|---|
general | API and worker pods (m6i/m7i/c7i families) |
argocd | ArgoCD controller (if cluster-level GitOps) |
hatchet | Hatchet engine (burst-tolerant) |
workers | Hatchet task workers (high-CPU) |
graph-engine | Graph-engine memory profile (r7i family) |
llm | Legacy SaaS pool — arm64 CPU only (c7g.large). Pre-existed customer in-cluster inference; kept for backward compat. Do not target this pool for GPU vLLM. |
embedding-gpu | TEI embedding pods (G6/L4, taint embedding-gpu=true:NoSchedule, label worker-pool: embedding-gpu) |
embedding.yaml for the old CPU vLLM embedding pool. Do not apply it for TEI embeddings. These are pool-shape definitions only. Bringing up Karpenter itself (controller install, IAM role for Karpenter, SQS queue, EventBridge rules) is a separate prerequisite covered upstream at karpenter.sh/docs/getting-started/. Apply only the pool manifests your deployment uses after the controller is running:
embedding-gpu pool is needed when llm.embedding.mode: inCluster and teiEmbedding.enabled: true. If completions also run in-cluster, provision an x86_64 GPU pool for the vLLM instruction profile and set vllm.profiles[*].nodeSelector to match it. For fully external model deployments, skip the inference pools.
In-cluster inference (option B)
If you are air-gapped, have a strict latency budget, or cannot route embedding traffic to an external endpoint, Nebula can run model inference inside the cluster. On EKS, the recommended shape is vLLM for completions and TEI GPU embeddings for memory search. See In-cluster models for full architecture, sizing, and HuggingFace token provisioning. To enable full in-cluster inference on EKS, stack the in-cluster overlay on top of your base values. The overlay setsllm.completion.mode: inCluster and llm.embedding.mode: inCluster, enables the vLLM instruction profile, and enables teiEmbedding behind the tei-embedding Service. Hybrid topologies are first-class: use external completions with in-cluster TEI embeddings when you only need the memory-search hot path inside the VPC.
embedding-gpu Karpenter NodePool and NVIDIA device plugin must exist before TEI embedding pods can schedule. If completions also run in-cluster, the vLLM instruction profile needs a separate GPU NodePool that matches its node selector and tolerations.
Troubleshooting
API pods fail to connect to Postgres
API pods fail to connect to Postgres
Your
postgres.credentialsSecret may be missing or may not have the expected keys. The Secret must contain username and password (those exact lowercase key names — the chart reads them via secretKeyRef.key: username / .key: password). If you’re using ESO, check the ExternalSecret resource synced before the API and worker pods started.Hatchet pods crashloop with 'SecretKeyNotFound' on database_url
Hatchet pods crashloop with 'SecretKeyNotFound' on database_url
Your
hatchetPostgres.credentialsSecret is missing the database_url key (or whatever key name you set in hatchetPostgres.databaseUrlKey). Unlike the Nebula DB Secret, the Hatchet Secret must hold a single pre-encoded DSN — Hatchet’s binaries read DATABASE_URL directly and don’t URL-encode discrete fields. Verify with kubectl -n nebula get secret hatchet-postgres-credentials -o jsonpath='{.data.database_url}' | base64 -d and confirm the value parses as postgresql://user:pass@host:5432/db?sslmode=....Hatchet setup Job crashloops in wait-for-hatchet-dependencies
Hatchet setup Job crashloops in wait-for-hatchet-dependencies
The wait container probes
hatchetPostgres.host:port directly via nc -z. If you’re in mode=external you must set hatchetPostgres.host at values level — the chart cannot extract it from the DSN inside your credentialsSecret. Mismatch between hatchetPostgres.host and the host embedded in your database_url is a common misconfiguration; align them.Graph-engine pod crashlooping with 'AccessDenied' on S3
Graph-engine pod crashlooping with 'AccessDenied' on S3
Either (a) the IRSA role isn’t attached to the ServiceAccount, or (b) the role’s policy doesn’t include the bucket ARN. Check
kubectl -n nebula describe sa nebula-sa for the eks.amazonaws.com/role-arn annotation, and trace the IAM policy attached to that role.ALB ingress shows 'no targets' after install
ALB ingress shows 'no targets' after install
The AWS Load Balancer Controller takes 30-60s to provision the ALB on first install. Check
kubectl -n kube-system logs deploy/aws-load-balancer-controller for any IAM permission errors on the controller’s IRSA role.Postgres extension missing on first start
Postgres extension missing on first start
RDS doesn’t create extensions just because they are available.
vector, pg_partman, and pg_cron must be available to the master user; if your parameter group restricts extension installs with rds.allowed_extensions, include all three there. shared_preload_libraries must include pg_cron, and cron.database_name must match the Nebula database before bootstrap runs migrations.