Skip to main content
This is the recommended production deploy for any customer with a real AWS footprint. The Helm chart is the same artifact we run our own staging and production on, and EKS + Karpenter is the deploy shape our release pipeline is tuned for.

Prereqs

Before helm install, the following must be in place on the cluster side. If you’ve never set these up on an EKS cluster, budget half a day; each is well-documented upstream.

Cluster

  • EKS 1.30+ (matches what we run internally)
  • OIDC provider associated with the cluster (eksctl utils associate-iam-oidc-provider --cluster <name> --approve) — required for IRSA

Addons + controllers

ComponentPurposeInstall reference
KarpenterNode autoscalingkarpenter.sh/docs
AWS Load Balancer ControllerALB ingressaws-load-balancer-controller
EBS CSI Drivergp3 volumes for graph-engine / compactor / RabbitMQEKS addon: aws-ebs-csi-driver
Metrics ServerHPA metrics for API, workers, and optional CPU-backed workload HPAsmetrics-server
NVIDIA device pluginExposes GPU resources for vLLM and TEI embedding podsnvidia/k8s-device-plugin
External Secrets Operator (recommended)Sync from AWS Secrets Managerexternal-secrets.io
Karpenter needs a NodePool covering the instance families Nebula will run on. Our staging clusters use m6i, m7i, and c7i families; production also includes r7i for the graph-engine memory profile. The chart’s resource requests on the example values file fit comfortably in m7i.large and up.
  • RDS Postgres 16 in the same VPC as the cluster, with the cluster’s node security group allowed inbound on :5432. The Enterprise bundle includes infra/aws-rds-postgres if you want Terraform/OpenTofu to create the RDS instance and parameter group. If you bring your own RDS instance, make vector, pg_partman, and pg_cron available to the master user, set shared_preload_libraries=pg_cron, and set cron.database_name to the Nebula database name.
  • S3 bucket in the same region as the cluster. Versioning + SSE-S3 (or SSE-KMS) recommended.
See Managed AWS resources for the IAM policy + parameter-group settings.

IAM role for IRSA

Create one IAM role with the cluster’s OIDC provider in its trust policy, scoped to the chart’s ServiceAccount (nebula-sa in the install namespace when you helm install nebula …; if you pick a different release name, the SA is <release>-nebula-sa — confirm with kubectl -n <ns> get sa after install). Attach an inline policy granting:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<your-bucket>",
        "arn:aws:s3:::<your-bucket>/*"
      ]
    }
  ]
}
Reference the role ARN under serviceAccount.annotations.eks.amazonaws.com/role-arn in your values file.

Install

The fastest path is the guided AWS workspace generator. It writes the RDS Terraform inputs, installer env, and phase scripts into one operator-owned directory:
tar -xzf nebula-enterprise-<version>.tar.gz
cd nebula-enterprise-<version>/
sha256sum -c checksums.txt

./nebula-enterprise init aws \
  --output-dir nebula-prod-install \
  --eks-cluster-name <cluster-name> \
  --ecr-registry <account-id>.dkr.ecr.us-east-1.amazonaws.com \
  --vpc-id <vpc-id> \
  --subnet-id <private-subnet-a> \
  --subnet-id <private-subnet-b> \
  --allowed-security-group <eks-node-or-pod-sg> \
  --s3-bucket <bucket-name> \
  --service-account-role-arn <nebula-irsa-role-arn> \
  --eso-secret-path <secrets-manager-path> \
  --domain nebula.<your-domain> \
  --ingress-certificate-arn <acm-cert-arn>

./nebula-prod-install/scripts/00-seed-app-secret.sh
./nebula-prod-install/scripts/01-provision-rds.sh
./nebula-prod-install/scripts/02-install-nebula.sh
./nebula-prod-install/scripts/03-verify.sh
The phase scripts create the AWS Secrets Manager app secret at --eso-secret-path, create or update RDS, append the generated Postgres connection settings to nebula-install.env, fetch the RDS-managed admin password into PGPASSWORD, bootstrap and verify the Nebula and Hatchet databases, run Helm, and verify the rollout. Export provider keys such as OPENAI_API_KEY before 00-seed-app-secret.sh; to update an existing secret, edit nebula-prod-install/app-secret.json and rerun with UPDATE_APP_SECRET=1. For custom infrastructure or a bring-your-own database, use the lower-level wrapper directly:
./enterprise/install-k8s.sh --print-env-template > nebula-install.env
# Fill in nebula-install.env with EKS, ECR, Postgres, S3, IAM, secret, and ingress values.
./enterprise/install-k8s.sh --env-file nebula-install.env
The wrapper’s image-import phase verifies the same checksums before loading or pushing images; the manual check above catches a corrupted bundle before you edit the install env. The lower-level steps below are the same operations the shell wrapper performs.

1. Import images to ECR

The bundle’s images.tar contains every pinned image and bundle-manifest.json maps those images to Helm values. Use the bundled helper to load, retag, push, and generate digest-pinned image values:
./nebula-enterprise images import \
  --provider aws \
  --aws-region us-east-1 \
  --registry <account-id>.dkr.ecr.us-east-1.amazonaws.com \
  --repository-prefix nebula \
  --create-repositories \
  --output helm/values.images.generated.yaml
By default this imports Nebula first-party images (nebulaRuntime, graphEngine, and bundled Nebula postgres). For air-gapped EKS with no public-registry egress, add --scope all to mirror Hatchet, RabbitMQ, busybox, and Hatchet Postgres too. If you mirror images manually instead of using the helper, use the source references from bundle-manifest.json; BusyBox is shipped as public.ecr.aws/docker/library/busybox:1.37.0.

2. Seed secrets in AWS Secrets Manager [#secrets-bootstrap]

If you’re using ESO (recommended), put one JSON blob at the path you’ll reference under secrets.esoAws.awsSecretPath:
{
  "OPENAI_API_KEY": "sk-...",
  "NEBULA_SECRET_KEY": "<random 32 bytes hex>",
  "NEBULA_SERVICE_API_KEY": "<random 32 bytes hex>",
  "NEBULA_WEBHOOK_HMAC_SECRET": "<random 32 bytes hex>",
  "NEBULA_JWT_PRIVATE_KEY_PEM": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----",
  "NEBULA_JWT_KID": "<stable per-deployment value>",
  "NEBULA_JWT_RETIRED_PUBLIC_KEYS_JSON": "[]",
  "NEBULA_INTERNAL_WAKE_TOKEN": "<random 32 bytes hex>",
  "NEBULA_VECTOR_BUILD_HATCHET_TRIGGER_TOKEN": "<random 32 bytes hex>"
}
NEBULA_JWT_RETIRED_PUBLIC_KEYS_JSON can stay [] on a fresh install. Populate it only during JWT signing-key rotation; see Service authentication. When you run the guided PROVISION_POSTGRES=1 path, nebula-enterprise postgres provision creates the Postgres credential Kubernetes Secrets directly and the installer immediately runs nebula-enterprise postgres verify against the same contract. If you instead sync database credentials from AWS Secrets Manager, keep them separate from the app-secret JSON, preserve the chart’s two Secret shapes, and run nebula-enterprise postgres verify before Helm install:
  • postgres.credentialsSecret: Kubernetes Secret with username + password keys (those exact lowercase key names — the chart reads them via secretKeyRef.key: username / .key: password).
  • hatchetPostgres.credentialsSecret: Kubernetes Secret with a single database_url key (override the key name via hatchetPostgres.databaseUrlKey) holding the full pre-encoded DSN, including ?sslmode=... and &sslrootcert=/etc/ssl/hatchet-postgres-ca.crt when TLS verification is required. Hatchet v0.79 reads DATABASE_URL directly and does not URL-encode discrete fields itself, so the chart relies on the operator to provide a properly-encoded URL. If your secret backend stores discrete fields, derive database_url at sync time — the External Secrets Operator target.template.data directive is the standard pattern:
    spec:
      target:
        template:
          data:
            database_url: |
              postgresql://{{ urlquery .username }}:{{ urlquery .password }}@<hatchet-rds-endpoint>:5432/hatchet?sslmode=verify-full&sslrootcert=/etc/ssl/hatchet-postgres-ca.crt
      dataFrom:
        - extract:
            key: hatchet-postgres-credentials
    

3. Copy + fill the reference values file

The bundle ships a reference values file at helm/examples/eks/values.yaml with every AWS-specific knob pre-wired (Karpenter, gp3, IRSA, ALB, RDS+S3, ESO). Copy it, fill in the <placeholder> markers (account ID, RDS endpoint, IRSA role ARN, ACM cert ARN, S3 bucket name), and save as your-values.yaml.

4. Install

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/_common/production-sizing.yaml \
  -f your-values.yaml \
  -f helm/values.images.generated.yaml
_common/production-sizing.yaml is the shared production-shape sizing block (replicas, CPU/memory requests + limits, persistence) used by all three cloud-managed K8s examples (EKS/AKS/GKE). Omit it to keep the chart’s minimal-dev defaults; override per-workload in your-values.yaml to fit your cluster. The chart runs schema migrations and catalog-apply automatically via a per-revision Job (<release>-nebula-migrations-<revision>); API and worker pods gate startup on an init container that polls public.nebula_release_contract for the install’s release row. The Job requires releaseContract.releaseId and releaseContract.gitSha to be set — both are stamped by bundle.sh into the bundled values file and consumed automatically; no manual intervention is needed on a standard install. If you supply your own values file without those keys, the chart fails at template time rather than rendering a malformed migration.

5. Verify

kubectl -n nebula get pods
kubectl -n nebula get ingress nebula
# Once the ALB is provisioned:
curl -fsS https://nebula.<your-domain>.com/v1/health

Upgrade

Pull the new bundle, push new images to your ECR, then:
helm upgrade nebula ./helm/nebula-<new-version>.tgz \
  -n nebula \
  -f your-values.yaml
Rolling update; no downtime on the API tier once the target database has already been migrated for the bundle version.

Sizing reference

The example values file ships with production-shape defaults for a starter deployment. Scale from there based on measured throughput:
WorkloadStarterWhen to scale
API2 replicas, 1 CPU / 2-4 GBHPA on CPU >70% sustained
Worker2 replicas, 2 CPU / 4-8 GBHPA on queue depth (Hatchet metric)
Graph engine2 replicas, 2 CPU / 4-8 GBManual; restart-sensitive (WAL replay)
Compactor1 replica, 1 CPU / 2-4 GBSingle-writer; do not scale horizontally
RabbitMQ1 replica, 8 GB PVCSingle-broker is fine up to ~10k workflows/min

Karpenter + long-running pods

The example values file sets karpenter.enabled=true, which adds karpenter.sh/do-not-disrupt: "true" to the API, worker, graph-engine, compactor, Hatchet engine, and TEI embedding pods. This prevents Karpenter consolidation or drift from killing pods mid-ingest, mid-graph-build, mid-snapshot, or during a warm embedding-cache window. Pods still drain on actual node lifecycle events (rolling update, manual kubectl drain). If you’re on Cluster Autoscaler instead of Karpenter, leave karpenter.enabled=false. CA respects PDBs by default and doesn’t need the annotation.

Karpenter NodePools

The bundle ships ready-to-apply Karpenter resources under helm/examples/eks/karpenter/. That directory contains EC2NodeClass and NodePool manifests for the following pool shapes:
PoolPurpose
generalAPI and worker pods (m6i/m7i/c7i families)
argocdArgoCD controller (if cluster-level GitOps)
hatchetHatchet engine (burst-tolerant)
workersHatchet task workers (high-CPU)
graph-engineGraph-engine memory profile (r7i family)
llmLegacy SaaS pool — arm64 CPU only (c7g.large). Pre-existed customer in-cluster inference; kept for backward compat. Do not target this pool for GPU vLLM.
embedding-gpuTEI embedding pods (G6/L4, taint embedding-gpu=true:NoSchedule, label worker-pool: embedding-gpu)
The directory also includes legacy embedding.yaml for the old CPU vLLM embedding pool. Do not apply it for TEI embeddings. These are pool-shape definitions only. Bringing up Karpenter itself (controller install, IAM role for Karpenter, SQS queue, EventBridge rules) is a separate prerequisite covered upstream at karpenter.sh/docs/getting-started/. Apply only the pool manifests your deployment uses after the controller is running:
kubectl apply -f helm/examples/eks/karpenter/general.yaml
kubectl apply -f helm/examples/eks/karpenter/hatchet.yaml
kubectl apply -f helm/examples/eks/karpenter/workers.yaml
kubectl apply -f helm/examples/eks/karpenter/graph-engine.yaml
kubectl apply -f helm/examples/eks/karpenter/embedding-gpu.yaml
The embedding-gpu pool is needed when llm.embedding.mode: inCluster and teiEmbedding.enabled: true. If completions also run in-cluster, provision an x86_64 GPU pool for the vLLM instruction profile and set vllm.profiles[*].nodeSelector to match it. For fully external model deployments, skip the inference pools.

In-cluster inference (option B)

If you are air-gapped, have a strict latency budget, or cannot route embedding traffic to an external endpoint, Nebula can run model inference inside the cluster. On EKS, the recommended shape is vLLM for completions and TEI GPU embeddings for memory search. See In-cluster models for full architecture, sizing, and HuggingFace token provisioning. To enable full in-cluster inference on EKS, stack the in-cluster overlay on top of your base values. The overlay sets llm.completion.mode: inCluster and llm.embedding.mode: inCluster, enables the vLLM instruction profile, and enables teiEmbedding behind the tei-embedding Service. Hybrid topologies are first-class: use external completions with in-cluster TEI embeddings when you only need the memory-search hot path inside the VPC.
helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/eks/values.yaml \
  -f helm/examples/eks/values-vllm-inCluster.yaml
The embedding-gpu Karpenter NodePool and NVIDIA device plugin must exist before TEI embedding pods can schedule. If completions also run in-cluster, the vLLM instruction profile needs a separate GPU NodePool that matches its node selector and tolerations.

Troubleshooting

Your postgres.credentialsSecret may be missing or may not have the expected keys. The Secret must contain username and password (those exact lowercase key names — the chart reads them via secretKeyRef.key: username / .key: password). If you’re using ESO, check the ExternalSecret resource synced before the API and worker pods started.
Your hatchetPostgres.credentialsSecret is missing the database_url key (or whatever key name you set in hatchetPostgres.databaseUrlKey). Unlike the Nebula DB Secret, the Hatchet Secret must hold a single pre-encoded DSN — Hatchet’s binaries read DATABASE_URL directly and don’t URL-encode discrete fields. Verify with kubectl -n nebula get secret hatchet-postgres-credentials -o jsonpath='{.data.database_url}' | base64 -d and confirm the value parses as postgresql://user:pass@host:5432/db?sslmode=....
The wait container probes hatchetPostgres.host:port directly via nc -z. If you’re in mode=external you must set hatchetPostgres.host at values level — the chart cannot extract it from the DSN inside your credentialsSecret. Mismatch between hatchetPostgres.host and the host embedded in your database_url is a common misconfiguration; align them.
Either (a) the IRSA role isn’t attached to the ServiceAccount, or (b) the role’s policy doesn’t include the bucket ARN. Check kubectl -n nebula describe sa nebula-sa for the eks.amazonaws.com/role-arn annotation, and trace the IAM policy attached to that role.
The AWS Load Balancer Controller takes 30-60s to provision the ALB on first install. Check kubectl -n kube-system logs deploy/aws-load-balancer-controller for any IAM permission errors on the controller’s IRSA role.
RDS doesn’t create extensions just because they are available. vector, pg_partman, and pg_cron must be available to the master user; if your parameter group restricts extension installs with rds.allowed_extensions, include all three there. shared_preload_libraries must include pg_cron, and cron.database_name must match the Nebula database before bootstrap runs migrations.