On-premises Kubernetes

This guide covers a production (or evaluation) Nebula install on on-premises Kubernetes — bare-metal clusters, VMware Tanzu, OpenStack, or any CNCF-conformant cluster without a public cloud identity layer (no IRSA, no Workload Identity). Secrets are managed inline or via a private vault. Storage is local-path, Longhorn, or TopoLVM.

Prereqs

Cluster

Kubernetes 1.26+ (matches the chart’s kubeVersion minimum)
kubectl access with permission to create namespaces, Deployments, StatefulSets, PVCs, and Ingresses

Addons + controllers

Component	Purpose	Notes
ingress-nginx	HTTP/HTTPS ingress	kubernetes.github.io/ingress-nginx
cert-manager	TLS from Let’s Encrypt or internal CA	cert-manager.io/docs
Local storage provisioner	PVCs for graph-engine, compactor, Postgres, Queue	local-path, Longhorn, or TopoLVM
DynamoDB-compatible service	Durable orchestration state	AWS DynamoDB or an internally operated DynamoDB-compatible endpoint with string `pk` / `sk` keys and support for `GetItem`, `PutItem`, `UpdateItem`, `DeleteItem`, `Query`, `BatchGetItem`, `BatchWriteItem`, `TransactWriteItems`, and `DescribeTable`; helper-managed tables also require `CreateTable`, `DescribeTimeToLive`, and `UpdateTimeToLive`
External Secrets Operator (optional)	Sync from HashiCorp Vault or other backend	Only needed if you have a private secrets store

Storage class: the chart defaults to the cluster’s default storage class when storageClass.name is empty. For on-prem clusters that ship with local-path (RKE2), leave storageClass.name unset. For Longhorn or TopoLVM, set storageClass.name to the provisioner’s class name (e.g. longhorn or topolvm-provisioner). TLS: if your cluster is internal-only and you have a corporate CA, configure cert-manager with a ClusterIssuer backed by your CA’s private key. For clusters with internet access, use the standard Let’s Encrypt ACME ClusterIssuer.

Postgres

For evaluation, the chart ships a single-replica Postgres StatefulSet (postgres.mode: bundled). This is safe for testing but not for production — the bundled StatefulSet has no HA, no automated backup, and no streaming replication. For production, use an external PostgreSQL 16 server. If you have an empty external server and admin credentials, the bundle can create the Nebula role, logical database, required extensions, and chart credential Secret:

./nebula-enterprise postgres provision \
  --namespace nebula \
  --admin-url "postgresql://postgres@pg.example.internal:5432/postgres?sslmode=require"

Set PGPASSWORD for the admin role. Pass --nebula-host only when application pods should connect through a hostname different from the --admin-url host. If your platform team provisions databases separately, mirror the same contract: a Nebula user/database, required extensions in the Nebula database, and a Kubernetes Secret with username and password keys. pg_cron also requires shared_preload_libraries=pg_cron and cron.database_name set to the Nebula database name before bootstrap. Then run the read-only verifier before pointing postgres.mode: external at it:

./nebula-enterprise postgres verify \
  --namespace nebula \
  --admin-url "postgresql://postgres@pg.example.internal:5432/postgres?sslmode=require"

Install

1. Load images from the bundle

tar -xzf nebula-enterprise-<version>.tar.gz
cd nebula-enterprise-<version>/
sha256sum -c checksums.txt
docker load -i images.tar

For an air-gapped cluster with a private registry, retag and push to your internal registry:

REGISTRY=registry.corp.example.com

docker tag nebula:enterprise-<version>              "${REGISTRY}/nebula/nebula-runtime:<version>"
docker tag nebula-graph-engine:enterprise-<version> "${REGISTRY}/nebula/graph-engine:<version>"
docker tag nebula-postgres:enterprise-<version>     "${REGISTRY}/nebula/postgres:<version>"
docker push "${REGISTRY}/nebula/nebula-runtime:<version>"
docker push "${REGISTRY}/nebula/graph-engine:<version>"
docker push "${REGISTRY}/nebula/postgres:<version>"

For third-party images, push to the same registry:

docker tag public.ecr.aws/docker/library/busybox:1.37.0       "${REGISTRY}/busybox:1.37.0"
docker push "${REGISTRY}/busybox:1.37.0"

Then set the mirrored repositories in your values file:

image:
  busybox:
    repository: busybox

2. Provision secrets

Option A: inline Kubernetes Secrets (simplest, not recommended for production) Use secrets.backend: raw and put secret values directly in your values file:

secrets:
  backend: raw
  values:
    OPENAI_API_KEY: "sk-..."
    NEBULA_SECRET_KEY: "<random 32 bytes hex>"
    NEBULA_SERVICE_API_KEY: "<random 32 bytes hex>"
    NEBULA_WEBHOOK_HMAC_SECRET: "<random 32 bytes hex>"
    NEBULA_JWT_PRIVATE_KEY_PEM: |
      -----BEGIN PRIVATE KEY-----
      ...
      -----END PRIVATE KEY-----
    NEBULA_JWT_KID: "<stable per-deployment value>"
    NEBULA_JWT_RETIRED_PUBLIC_KEYS_JSON: "[]"
    NEBULA_INTERNAL_WAKE_TOKEN: "<random 32 bytes hex>"

NEBULA_JWT_RETIRED_PUBLIC_KEYS_JSON can stay [] on a fresh install. Populate it only during JWT signing-key rotation; see Service authentication. Option B: HashiCorp Vault via ESO Install ESO, configure a ClusterSecretStore pointing at your Vault instance, then use secrets.backend: eso-vault:

secrets:
  backend: eso-vault
  esoVault:
    secretStoreRef:
      name: vault-backend
      kind: ClusterSecretStore
    vaultPath: secret/data/nebula
    refreshInterval: 5m

3. Copy + fill the reference values file

The bundle ships helm/examples/onprem/values.yaml with sensible on-prem defaults (bundled Postgres for evaluation, local-path storage, nginx ingress, raw secrets). Copy it, fill in the <placeholder> markers (domain name, object storage endpoint, LLM API base), and save as your-values.yaml. For a production on-prem install with external Postgres:

Set postgres.mode: external and fill in .host, .port, .database, and .credentialsSecret
Remove or comment out the bundled Postgres persistence blocks

If you use ./nebula-enterprise postgres provision, set postgres.credentialsSecret: nebula-postgres-credentials unless you passed a custom secret name to the helper.

4. Object storage

On-premises S3-compatible object storage options:

MinIO (recommended for simplicity): run MinIO alongside the cluster or as a StatefulSet inside it. Set objectStorage.endpoint: http://minio.minio.svc:9000, forcePathStyle: true, and store MinIO root credentials in objectStorage.credentialsSecret.
Ceph RGW: configure Ceph’s Rados Gateway. Set the RGW endpoint, region (or empty string), and HMAC credentials.
Cloudflare R2 / Wasabi: external but S3-compatible. Set the appropriate endpoint; forcePathStyle depends on the provider.

5. Orchestration state and broker

The on-prem chart defaults to the local polling transport. Durable orchestration state lives in DynamoDB-compatible tables configured through NEBULA_ORCHESTRATION_DYNAMODB_STATE_TABLE, NEBULA_ORCHESTRATION_DYNAMODB_RUNTIME_TABLE, NEBULA_ORCHESTRATION_DYNAMODB_EVENTS_TABLE, NEBULA_ORCHESTRATION_DYNAMODB_TRANSITIONS_TABLE, and NEBULA_ORCHESTRATION_DYNAMODB_ENDPOINT_URL; workers claim runnable tasks through Nebula’s internal API. Before Helm install, create or verify the four orchestration tables and writer-authority records. The helper uses the AWS CLI, so point the CLI at a DynamoDB-compatible endpoint and credentials reachable from the machine running the helper. If the service is cluster-internal, port-forward it for this command and keep the in-cluster service URL in Helm values.

./nebula-enterprise orchestration dynamodb ensure \
  --project-name nebula_default \
  --aws-region us-east-1 \
  --writer-region us-east-1 \
  --endpoint-url http://localhost:8000 \
  --state-table nebula-orchestration-state \
  --runtime-table nebula-orchestration-runtime \
  --events-table nebula-orchestration-events \
  --transitions-table nebula-orchestration-transitions

Then set the same table names and endpoint in your-values.yaml:

secrets:
  values:
    NEBULA_ORCHESTRATION_DYNAMODB_STATE_TABLE: nebula-orchestration-state
    NEBULA_ORCHESTRATION_DYNAMODB_RUNTIME_TABLE: nebula-orchestration-runtime
    NEBULA_ORCHESTRATION_DYNAMODB_EVENTS_TABLE: nebula-orchestration-events
    NEBULA_ORCHESTRATION_DYNAMODB_TRANSITIONS_TABLE: nebula-orchestration-transitions
    NEBULA_ORCHESTRATION_DYNAMODB_ENDPOINT_URL: http://dynamodb.<namespace>.svc.cluster.local:8000

For larger HA clusters, use an externally managed RabbitMQ cluster as the delivery broker:

enterprise:
  orchestrationQueueTransport: rabbitmq
  rabbitmqExchangeName: nebula.orchestration
  rabbitmqQueuePrefix: nebula.orchestration
  rabbitmqQueueType: quorum
  rabbitmqPrefetchCount: 64
  # Optional mTLS Secret with ca.crt, tls.crt, and tls.key.
  rabbitmqTlsSecretName: nebula-rabbitmq-client-tls

secrets:
  values:
    NEBULA_ORCHESTRATION_RABBITMQ_URL: amqps://<user>:<password>@rabbitmq.<namespace>.svc.cluster.local:5671/<vhost>

RabbitMQ remains a broker only; DynamoDB-compatible orchestration tables stay authoritative for workflow state. The RabbitMQ principal must be allowed to declare durable direct exchanges, durable quorum queues, retry bucket queues with TTL and dead-letter routing, and publish persistent messages. The bundle also includes helm/examples/onprem-rabbitmq-ha/values.yaml as a ready-to-layer overlay for this mode:

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f examples/onprem/values.yaml \
  -f examples/onprem-rabbitmq-ha/values.yaml \
  -f your-values.yaml

6. Install

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f your-values.yaml

The chart runs schema migrations and catalog-apply automatically via a per-revision Job (<release>-nebula-migrations-<revision>); API and worker pods gate startup on an init container that polls public.nebula_release_contract for the install’s release row. releaseContract.releaseId and releaseContract.gitSha are stamped by bundle.sh and consumed automatically.

7. Verify

kubectl -n nebula get pods
kubectl -n nebula get ingress nebula
curl -fsS https://nebula.<your-domain>/v1/health

Upgrade

Pull the new bundle, load/push new images, then:

helm upgrade nebula ./helm/nebula-<new-version>.tgz \
  -n nebula \
  -f your-values.yaml

Sizing reference

Workload	Starter	When to scale
API	2 replicas, 1 CPU / 2-4 GB	HPA on CPU >70% sustained
Worker	2 replicas, 2 CPU / 4-8 GB	HPA on queue depth (Orchestration metric)
Graph engine	2 replicas, 2 CPU / 4-8 GB	Manual; restart-sensitive (WAL replay)
Compactor	1 replica, 1 CPU / 2-4 GB	Single-writer; do not scale horizontally
Queue	1 replica, 8 GB PVC	Single-broker is fine up to ~10k workflows/min

For an evaluation single-node cluster, reducing to replicas: 1 on all workloads and using postgres.mode: bundled keeps the footprint under 16 GB RAM total. For production deploys the bundle ships a shared sizing overlay at helm/examples/_common/production-sizing.yaml (the same overlay used by EKS/AKS/GKE). Stack it before your on-prem values file to get production-shape replicas and resource requests:

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/_common/production-sizing.yaml \
  -f your-values.yaml

Pod Security Admission

The Nebula-built workloads (api, worker, graph-engine, graph-engine-compactor, the migration Job, and the vLLM sub-chart Deployments) comply with the restricted Pod Security Standard out of the box: non-root user, dropped capabilities, seccompProfile: RuntimeDefault, no privilege escalation. The bundled third-party postgres-statefulset (postgres.mode: bundled) inherits its upstream image’s default security context and does not carry the restricted-required fields today. Labeling the release namespace as restricted before validating this pod can reject it at admission time. Recommended approach:

For production deployments, swap postgres.mode: external so the bundled StatefulSet is not rendered at all.
If you need bundled Postgres for evaluation, label the namespace at baseline rather than restricted (or use PSA’s warn / audit modes to surface the issues without blocking install).
Only enable restricted enforcement after validating bundled Postgres against your cluster’s policy.

# Evaluation-friendly: warn on violations, don't enforce.
kubectl label namespace nebula \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

The chart deliberately does not label the namespace itself: Helm’s --create-namespace does not own pre-existing namespaces reliably, and adding namespace ownership to the chart conflicts with operators who manage namespaces separately (GitOps, vCluster, kiosk, etc.).

Prometheus metrics

Pods expose Prometheus-compatible /metrics endpoints and carry prometheus.io/scrape: "true" annotations for clusters that use annotation-based scrape discovery. For clusters running prometheus-operator / kube-prometheus-stack, enable native ServiceMonitor objects:

monitoring:
  serviceMonitor:
    enabled: true
    # Many operator installs key off a `release` label on ServiceMonitors;
    # set it to match your prometheus-operator's serviceMonitorSelector.
    additionalLabels:
      release: kube-prometheus-stack

Default off because ServiceMonitor is a monitoring.coreos.com/v1 CRD — rendering it on a cluster without prometheus-operator fails helm install with no matches for kind "ServiceMonitor".

Troubleshooting

PVCs stuck in Pending — no storage class available

Check that a storage class exists and is set as default: kubectl get storageclass. If using local-path, the provisioner must be running: kubectl -n local-path-storage get pods. Set storageClass.name in your values file to the exact class name if there is no cluster default.

API pods fail to connect to bundled Postgres

On a fresh install with postgres.mode: bundled, the Postgres StatefulSet must be ready before the API Deployment. Check kubectl -n nebula get pods — the Postgres pod must be Running before API pods reach Ready. The chart renders a readiness probe on the API that retries for 5 minutes, which is usually enough for bundled Postgres to start. If the pod restarts before Postgres is ready, describe the pod for the specific connect error.

cert-manager fails to issue certificate — ACME challenge not reachable

The Let’s Encrypt ACME HTTP-01 challenge requires the domain to be publicly reachable. For internal-only clusters, either use a DNS-01 challenge (configure cert-manager with DNS provider credentials) or provision certificates from a corporate CA ClusterIssuer. The ingress.tls.secretName in your values file must match the Certificate resource name cert-manager will populate.

Graph-engine startup slow after node restart

The graph-engine replays its WAL on startup — duration scales with segment count and is expected. A single-node cluster that reboots may take 30-120 seconds per replica before graph-engine is fully Ready. Add initialDelaySeconds: 120 to the graph-engine readiness probe via workloads.graphEngine overrides if the default timeouts are too tight for your node restart time.

Get Started

Kubernetes

Docker Compose

Connectors

Reference

On-premises Kubernetes

Prereqs

Cluster

Addons + controllers

Postgres

Install

1. Load images from the bundle

2. Provision secrets

3. Copy + fill the reference values file

4. Object storage

5. Orchestration state and broker

6. Install

7. Verify

Upgrade

Sizing reference

Pod Security Admission

Prometheus metrics

Troubleshooting

​Prereqs

​Cluster

​Addons + controllers

​Postgres

​Install

​1. Load images from the bundle

​2. Provision secrets

​3. Copy + fill the reference values file

​4. Object storage

​5. Orchestration state and broker

​6. Install

​7. Verify

​Upgrade

​Sizing reference

​Pod Security Admission

​Prometheus metrics

​Troubleshooting

Prereqs

Cluster

Addons + controllers

Postgres

Install

1. Load images from the bundle

2. Provision secrets

3. Copy + fill the reference values file

4. Object storage

5. Orchestration state and broker

6. Install

7. Verify

Upgrade

Sizing reference

Pod Security Admission

Prometheus metrics

Troubleshooting