EKS - Nebula

This is the recommended production deploy for any customer with a real AWS footprint. The Helm chart is the same artifact we run our own staging and production on, and EKS + Karpenter is the deploy shape our release pipeline is tuned for.

Prereqs

Before helm install, the following must be in place on the cluster side. If you’ve never set these up on an EKS cluster, budget half a day; each is well-documented upstream.

Cluster

EKS 1.30+ (matches what we run internally)
OIDC provider associated with the cluster (eksctl utils associate-iam-oidc-provider --cluster <name> --approve) — required for IRSA

Addons + controllers

Component	Purpose	Install reference
Karpenter	Node autoscaling	karpenter.sh/docs
AWS Load Balancer Controller	ALB ingress	aws-load-balancer-controller
EBS CSI Driver	gp3 volumes for graph-engine / compactor / Queue	EKS addon: `aws-ebs-csi-driver`
Metrics Server	HPA metrics for API, workers, and optional CPU-backed workload HPAs	metrics-server
NVIDIA device plugin	Exposes GPU resources for vLLM and TEI embedding pods	nvidia/k8s-device-plugin
External Secrets Operator (recommended)	Sync from AWS Secrets Manager	external-secrets.io

Karpenter needs a NodePool covering the instance families Nebula will run on. Our staging clusters use m6i, m7i, and c7i families; production also includes r7i for the graph-engine memory profile. The chart’s resource requests on the example values file fit comfortably in m7i.large and up.

AWS-managed resources (recommended)

RDS Postgres 16 in the same VPC as the cluster, with the cluster’s node security group allowed inbound on :5432. The Enterprise bundle includes infra/aws-rds-postgres if you want Terraform/OpenTofu to create the RDS instance and parameter group. If you bring your own RDS instance, make vector, pg_partman, and pg_cron available to the master user, set shared_preload_libraries=pg_cron, and set cron.database_name to the Nebula database name.
S3 bucket in the same region as the cluster. Versioning + SSE-S3 (or SSE-KMS) recommended.
DynamoDB orchestration tables in the same region as the cluster. Run nebula-enterprise orchestration dynamodb ensure to create or verify the four pk / sk tables and writer-authority records.

See Managed AWS resources for the IAM policy + parameter-group settings.

IAM role for IRSA

Create one IAM role with the cluster’s OIDC provider in its trust policy, scoped to the chart’s ServiceAccount (nebula-sa in the install namespace when you helm install nebula …; if you pick a different release name, the SA is <release>-nebula-sa — confirm with kubectl -n <ns> get sa after install). Attach an inline policy granting:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<your-bucket>",
        "arn:aws:s3:::<your-bucket>/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:ConditionCheckItem",
        "dynamodb:DeleteItem",
        "dynamodb:DescribeTable",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:Query",
        "dynamodb:UpdateItem",
        "dynamodb:BatchGetItem",
        "dynamodb:BatchWriteItem"
      ],
      "Resource": [
        "arn:aws:dynamodb:<region>:<account-id>:table/nebula-orchestration-state",
        "arn:aws:dynamodb:<region>:<account-id>:table/nebula-orchestration-runtime",
        "arn:aws:dynamodb:<region>:<account-id>:table/nebula-orchestration-events",
        "arn:aws:dynamodb:<region>:<account-id>:table/nebula-orchestration-transitions"
      ]
    }
  ]
}

DynamoDB transactions are authorized through the underlying item permissions plus dynamodb:ConditionCheckItem; the DynamoDB service or compatible endpoint must still support the TransactWriteItems API. Reference the role ARN under serviceAccount.annotations.eks.amazonaws.com/role-arn in your values file.

Install

The fastest path is the guided AWS workspace generator. It writes the RDS Terraform inputs, installer env, and phase scripts into one operator-owned directory:

tar -xzf nebula-enterprise-<version>.tar.gz
cd nebula-enterprise-<version>/
sha256sum -c checksums.txt

./nebula-enterprise init aws \
  --output-dir nebula-prod-install \
  --eks-cluster-name <cluster-name> \
  --ecr-registry <account-id>.dkr.ecr.us-east-1.amazonaws.com \
  --vpc-id <vpc-id> \
  --subnet-id <private-subnet-a> \
  --subnet-id <private-subnet-b> \
  --allowed-security-group <eks-node-or-pod-sg> \
  --s3-bucket <bucket-name> \
  --service-account-role-arn <nebula-irsa-role-arn> \
  --eso-secret-path <secrets-manager-path> \
  --domain nebula.<your-domain> \
  --ingress-certificate-arn <acm-cert-arn>

./nebula-prod-install/scripts/00-seed-app-secret.sh
./nebula-prod-install/scripts/01-provision-rds.sh
./nebula-prod-install/scripts/02-install-nebula.sh
./nebula-prod-install/scripts/03-verify.sh

The phase scripts create the AWS Secrets Manager app secret at --eso-secret-path, create or update RDS, append the generated Postgres connection settings to nebula-install.env, fetch the RDS-managed admin password into PGPASSWORD, bootstrap and verify the Nebula database, run Helm, and verify the rollout. Export provider keys such as OPENAI_API_KEY before 00-seed-app-secret.sh; to update an existing secret, edit nebula-prod-install/app-secret.json and rerun with UPDATE_APP_SECRET=1. For custom infrastructure or a bring-your-own database, use the lower-level wrapper directly:

./enterprise/install-k8s.sh --print-env-template > nebula-install.env
# Fill in nebula-install.env with EKS, ECR, Postgres, S3, IAM, secret, and ingress values.
./enterprise/install-k8s.sh --env-file nebula-install.env

The wrapper’s image-import phase verifies the same checksums before loading or pushing images; the manual check above catches a corrupted bundle before you edit the install env. The lower-level steps below are the same operations the shell wrapper performs.

1. Import images to ECR

The bundle’s images.tar contains every pinned image and bundle-manifest.json maps those images to Helm values. Use the bundled helper to load, retag, push, and generate digest-pinned image values:

./nebula-enterprise images import \
  --provider aws \
  --aws-region us-east-1 \
  --registry <account-id>.dkr.ecr.us-east-1.amazonaws.com \
  --repository-prefix nebula \
  --create-repositories \
  --output helm/values.images.generated.yaml

By default this imports Nebula first-party images (nebulaRuntime, graphEngine, and bundled Nebula postgres). For air-gapped EKS with no public-registry egress, add --scope all to mirror Orchestration, Queue, busybox, and Orchestration Postgres too. If you mirror images manually instead of using the helper, use the source references from bundle-manifest.json; BusyBox is shipped as public.ecr.aws/docker/library/busybox:1.37.0.

2. Seed secrets in AWS Secrets Manager [#secrets-bootstrap]

If you’re using ESO (recommended), put one JSON blob at the path you’ll reference under secrets.esoAws.awsSecretPath:

{
  "OPENAI_API_KEY": "sk-...",
  "NEBULA_SECRET_KEY": "<random 32 bytes hex>",
  "NEBULA_SERVICE_API_KEY": "<random 32 bytes hex>",
  "NEBULA_WEBHOOK_HMAC_SECRET": "<random 32 bytes hex>",
  "NEBULA_JWT_PRIVATE_KEY_PEM": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----",
  "NEBULA_JWT_KID": "<stable per-deployment value>",
  "NEBULA_JWT_RETIRED_PUBLIC_KEYS_JSON": "[]",
  "NEBULA_INTERNAL_WAKE_TOKEN": "<random 32 bytes hex>",
  "NEBULA_VECTOR_BUILD_ORCHESTRATION_TRIGGER_TOKEN": "<random 32 bytes hex>"
}

NEBULA_JWT_RETIRED_PUBLIC_KEYS_JSON can stay [] on a fresh install. Populate it only during JWT signing-key rotation; see Service authentication. When you run the guided PROVISION_POSTGRES=1 path, nebula-enterprise postgres provision creates the Postgres credential Kubernetes Secret directly and the installer immediately runs nebula-enterprise postgres verify against the same contract. If you instead sync database credentials from AWS Secrets Manager, keep them separate from the app-secret JSON, preserve the chart’s Secret shape, and run nebula-enterprise postgres verify before Helm install:

postgres.credentialsSecret: Kubernetes Secret with username + password keys (those exact lowercase key names — the chart reads them via secretKeyRef.key: username / .key: password).

3. Copy + fill the reference values file

The bundle ships a reference values file at helm/examples/eks/values.yaml with every AWS-specific knob pre-wired (Karpenter, gp3, IRSA, ALB, RDS+S3, ESO). Copy it, fill in the <placeholder> markers (account ID, RDS endpoint, IRSA role ARN, ACM cert ARN, S3 bucket name), and save as your-values.yaml.

4. Install

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/_common/production-sizing.yaml \
  -f your-values.yaml \
  -f helm/values.images.generated.yaml

_common/production-sizing.yaml is the shared production-shape sizing block (replicas, CPU/memory requests + limits, persistence) used by all three cloud-managed K8s examples (EKS/AKS/GKE). Omit it to keep the chart’s minimal-dev defaults; override per-workload in your-values.yaml to fit your cluster. The chart runs schema migrations and catalog-apply automatically via a per-revision Job (<release>-nebula-migrations-<revision>); API and worker pods gate startup on an init container that polls public.nebula_release_contract for the install’s release row. The Job requires releaseContract.releaseId and releaseContract.gitSha to be set — both are stamped by bundle.sh into the bundled values file and consumed automatically; no manual intervention is needed on a standard install. If you supply your own values file without those keys, the chart fails at template time rather than rendering a malformed migration.

5. Verify

kubectl -n nebula get pods
kubectl -n nebula get ingress nebula
# Once the ALB is provisioned:
curl -fsS https://nebula.<your-domain>.com/v1/health

Upgrade

Pull the new bundle, push new images to your ECR, then:

helm upgrade nebula ./helm/nebula-<new-version>.tgz \
  -n nebula \
  -f your-values.yaml

Rolling update; no downtime on the API tier once the target database has already been migrated for the bundle version.

Sizing reference

The example values file ships with production-shape defaults for a starter deployment. Scale from there based on measured throughput:

Workload	Starter	When to scale
API	2 replicas, 1 CPU / 2-4 GB	HPA on CPU >70% sustained
Worker	2 replicas, 2 CPU / 4-8 GB	HPA on queue depth (Orchestration metric)
Graph engine	2 replicas, 2 CPU / 4-8 GB	Manual; restart-sensitive (WAL replay)
Compactor	1 replica, 1 CPU / 2-4 GB	Single-writer; do not scale horizontally
Queue	1 replica, 8 GB PVC	Single-broker is fine up to ~10k workflows/min

Karpenter + long-running pods

The example values file sets karpenter.enabled=true, which adds karpenter.sh/do-not-disrupt: "true" to the API, worker, graph-engine, compactor, Orchestration engine, and TEI embedding pods. This prevents Karpenter consolidation or drift from killing pods mid-ingest, mid-graph-build, mid-snapshot, or during a warm embedding-cache window. Pods still drain on actual node lifecycle events (rolling update, manual kubectl drain). If you’re on Cluster Autoscaler instead of Karpenter, leave karpenter.enabled=false. CA respects PDBs by default and doesn’t need the annotation.

Karpenter NodePools

The bundle ships ready-to-apply Karpenter resources under helm/examples/eks/karpenter/. That directory contains EC2NodeClass and NodePool manifests for the following pool shapes:

Pool	Purpose
`general`	API and worker pods (m6i/m7i/c7i families)
`argocd`	ArgoCD controller (if cluster-level GitOps)
`orchestration`	Orchestration engine (burst-tolerant)
`workers`	Orchestration task workers (high-CPU)
`graph-engine`	Graph-engine memory profile (`r7i` family)
`llm`	Legacy SaaS pool — arm64 CPU only (`c7g.large`). Pre-existed customer in-cluster inference; kept for backward compat. Do not target this pool for GPU vLLM.
`embedding-gpu`	TEI embedding pods (G6/L4, taint `embedding-gpu=true:NoSchedule`, label `worker-pool: embedding-gpu`)

The directory also includes legacy embedding.yaml for the old CPU vLLM embedding pool. Do not apply it for TEI embeddings. These are pool-shape definitions only. Bringing up Karpenter itself (controller install, IAM role for Karpenter, SQS queue, EventBridge rules) is a separate prerequisite covered upstream at karpenter.sh/docs/getting-started/. Apply only the pool manifests your deployment uses after the controller is running:

kubectl apply -f helm/examples/eks/karpenter/general.yaml
kubectl apply -f helm/examples/eks/karpenter/orchestration.yaml
kubectl apply -f helm/examples/eks/karpenter/workers.yaml
kubectl apply -f helm/examples/eks/karpenter/graph-engine.yaml
kubectl apply -f helm/examples/eks/karpenter/embedding-gpu.yaml

The embedding-gpu pool is needed when teiEmbedding.enabled: true. If completions also run in-cluster, provision an x86_64 GPU pool for the vLLM instruction profile and set vllm.profiles[*].nodeSelector to match it.

In-cluster inference (option B)

If you are air-gapped or have a strict latency budget, Nebula can run model inference inside the cluster. On EKS, the recommended shape is vLLM for completions and TEI GPU embeddings for memory search. See In-cluster models for full architecture, sizing, and HuggingFace token provisioning. To enable full in-cluster inference on EKS, stack the in-cluster overlay on top of your base values. The overlay sets llm.completion.mode: inCluster and llm.embedding.mode: inCluster, enables the vLLM instruction profile, and enables teiEmbedding behind the tei-embedding Service. Hybrid topologies are first-class: use external completions with in-cluster TEI embeddings when you only need the memory-search hot path inside the VPC.

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/eks/values.yaml \
  -f helm/examples/eks/values-vllm-inCluster.yaml

The embedding-gpu Karpenter NodePool and NVIDIA device plugin must exist before TEI embedding pods can schedule. If completions also run in-cluster, the vLLM instruction profile needs a separate GPU NodePool that matches its node selector and tolerations.

Troubleshooting

API pods fail to connect to Postgres

Your postgres.credentialsSecret may be missing or may not have the expected keys. The Secret must contain username and password (those exact lowercase key names — the chart reads them via secretKeyRef.key: username / .key: password). If you’re using ESO, check the ExternalSecret resource synced before the API and worker pods started.

Graph-engine pod crashlooping with 'AccessDenied' on S3

Either (a) the IRSA role isn’t attached to the ServiceAccount, or (b) the role’s policy doesn’t include the bucket ARN. Check kubectl -n nebula describe sa nebula-sa for the eks.amazonaws.com/role-arn annotation, and trace the IAM policy attached to that role.

ALB ingress shows 'no targets' after install

The AWS Load Balancer Controller takes 30-60s to provision the ALB on first install. Check kubectl -n kube-system logs deploy/aws-load-balancer-controller for any IAM permission errors on the controller’s IRSA role.

Postgres extension missing on first start

RDS doesn’t create extensions just because they are available. vector, pg_partman, and pg_cron must be available to the master user; if your parameter group restricts extension installs with rds.allowed_extensions, include all three there. shared_preload_libraries must include pg_cron, and cron.database_name must match the Nebula database before bootstrap runs migrations.

​Prereqs

​Cluster

​Addons + controllers

​AWS-managed resources (recommended)

​IAM role for IRSA

​Install

​1. Import images to ECR

​2. Seed secrets in AWS Secrets Manager [#secrets-bootstrap]

​3. Copy + fill the reference values file

​4. Install

​5. Verify

​Upgrade

​Sizing reference

​Karpenter + long-running pods

​Karpenter NodePools

​In-cluster inference (option B)

​Troubleshooting