Skip to content

Deployment

This is the canonical guide for deploying Neksur to production AWS. It covers the commit data path in depth — the metadata graph, the catalog gateway, OpenLineage ingestion, and the in-process detection pool — running as a stateless cmd/neksur-server binary behind an Application Load Balancer. The same binary also serves the read-path SQL proxy, pgwire, XMLA, GraphQL, and MCP surfaces; the operational concerns below (HA, fail-closed alerting, health checks, backup/DR) apply to the whole process.

For the infrastructure-as-code, deployment modes (SaaS / self-managed / air-gapped), and the Spark writer-side integration, see the new sections near the end of this guide. The target audience is the SRE / platform team operating Neksur for a single tenant or a small fleet of tenants.

Editions. Neksur is a single binary; commercial-tier coordination (multi-engine, defense-in-depth, intelligence) is unlocked by a signed license verified at startup — there is no separate build to deploy. See Editions and tiers.

Companion docs. Read Architecture before this guide — the three-layer enforcement model and the AGE graph foundation set the context for every infrastructure choice below. After install, see Getting Started for the end-to-end dev flow and the REST API Reference for the endpoint surface the gateway exposes.

Operators bringing up a new Neksur deployment need the following before they start:

  • AWS account with permissions to provision RDS (Postgres 16 + AGE 1.6), an Application Load Balancer, an EC2 Auto Scaling Group, ACM certificates, Secrets Manager entries, and (optionally) SNS + SQS for S3 ObjectCreated event delivery.
  • Terraform CLI 1.6+ for driving the private neksur-com/infra repository that holds the production Terraform modules.
  • Iceberg catalog endpoint — a reachable Apache Polaris OR Project Nessie instance. Phase 1 ships live adapters for both; Glue and Unity Catalog adapters are stub-only and rejected at startup until Phase 3.
  • OIDC identity provider — WorkOS is the Phase 0.5 / Phase 1 default; WORKOS_API_KEY, WORKOS_CLIENT_ID, and WORKOS_WEBHOOK_SECRET must be available in Secrets Manager before the gateway will start in SaaS mode.
  • Go 1.24+ on the build host if you are baking your own AMI rather than consuming a published artifact.

Optional but strongly recommended for production:

  • PagerDuty integration key — populates ${PAGERDUTY_INTEGRATION_KEY} in alertmanager.yml and routes severity=page alerts (most importantly PolicyEngineUnavailableAlert) to the on-call rotation.
  • Slack incoming webhook URL — populates ${SLACK_WEBHOOK_URL} for severity=warn alerts (e.g., PolicyDeniedSpikeAlert) and, separately, NEKSUR_SLACK_WEBHOOK_URL for L3 detection findings at confidence ≥ 0.85.
  • AWS Secrets Manager for all of the above; cmd/neksur-server/main.go reads them as plain env vars, and Phase 1 production deployments inject them via the Terraform layer.

A staging tenant with a single replica is acceptable for evaluation; production tenants MUST run with ≥ 2 replicas behind the ALB — see HA + scaling below for the rationale.

The Phase 1 production topology is a single-region AWS deployment with a stateless gateway tier in front of a shared metadata RDS and an optional SQS event source. The diagram below is the operator’s-eye map; for the per-commit pipeline that runs inside each replica, see Architecture.

graph TD
Client["Iceberg client<br/>(Spark / Trino / pyiceberg)"]
ALB["AWS ALB<br/>TLS termination<br/>health check: GET /readyz"]
R1["neksur-server replica 1<br/>L1 gateway + L3 pool<br/>(in-process)"]
R2["neksur-server replica 2<br/>L1 gateway + L3 pool"]
R3["neksur-server replica N<br/>(ASG: min 2, max 4)"]
RDS["Postgres 16 + AGE 1.6<br/>(Pool A, RDS Multi-AZ<br/>for production)"]
Upstream["Upstream Iceberg catalog<br/>(Polaris / Nessie)"]
S3["S3 bucket<br/>(Iceberg data + metadata)"]
SNS["SNS topic<br/>(S3 ObjectCreated, optional)"]
SQS["SQS queue<br/>(shared, competitive consume)"]
Prom["Prometheus<br/>scrape :9100/metrics"]
AM["AlertManager<br/>severity-tier routing"]
PD["PagerDuty<br/>severity=page"]
Slack["Slack<br/>severity=warn + L3 findings"]
Client -->|HTTPS 443| ALB
ALB -->|HTTP 8080| R1
ALB -->|HTTP 8080| R2
ALB -->|HTTP 8080| R3
R1 --> RDS
R2 --> RDS
R3 --> RDS
R1 -->|forward commits| Upstream
R2 -->|forward commits| Upstream
R3 -->|forward commits| Upstream
S3 --> SNS
SNS --> SQS
SQS -->|competitive consume| R1
SQS -->|competitive consume| R2
SQS -->|competitive consume| R3
R1 -.->|/metrics :9100| Prom
R2 -.->|/metrics :9100| Prom
R3 -.->|/metrics :9100| Prom
Prom --> AM
AM -->|severity=page| PD
AM -->|severity=warn| Slack

Component notes:

  • ALB terminates TLS on port 443 with an ACM certificate. Target group health check probes GET /readyz (NOT /healthz — see Health-check endpoints below). Routes POST /v1/iceberg/*, POST /v1/lineage, POST /v1/webhooks/polaris, plus /api/*, /admin/*, /callback, /webhooks/workos, and /webhooks/stripe to the gateway target group.
  • neksur-server ECS / EC2 replicas are stateless from the operator’s perspective for traffic routing — any replica can serve any commit. The L3 detection dedup map is in-process per replica, but cross-replica dedup is enforced at the database layer by the detection_runs.snapshot_metadata_location UNIQUE constraint, so a replica restart is safe. Auto-Scaling Group sized 2 → 4.
  • Postgres 16 + AGE 1.6 RDS is the canonical metadata store (“Pool A”). Single-AZ is acceptable for non-prod; Multi-AZ is required for production so a single-AZ outage does not stop commit traffic. Each gateway replica opens its own pgxpool with the BeforeAcquire DISCARD ALL hook applied; do not add a second pool inside the gateway process.
  • S3 + SNS + SQS is optional and only required when the L3 detection trigger needs to catch direct-to-S3 writes that bypass the gateway. When NEKSUR_S3_EVENTS_QUEUE_URL is unset, only the 30-second poller and the Polaris webhook fire and the SQS goroutine is not started. The SQS queue is shared across replicas and consumed competitively; the first replica to call ReceiveMessage + DeleteMessage wins each message.
  • Upstream Iceberg catalog is your existing Polaris or Nessie deployment. Neksur does not replace it — the gateway proxies commits through to it after policy evaluation. Glue and Unity Catalog adapters are stub-only in Phase 1 and the gateway rejects configurations with catalog_kind = "glue" or "unity" at startup.
  • Prometheus + AlertManager scrape each replica’s /metrics endpoint on port :9100 and apply severity-tier routing (severity=page → PagerDuty; severity=warn → Slack). See Observability for the alert definitions.

Production-specific infrastructure details — Terraform module names, VPC subnet layout, security group rules, IAM role bundles, DNS records — live in the private neksur-com/infra repository and are not enumerated here.

Every variable below is read by cmd/neksur-server/main.go (the Phase 1 entry point). Production deployments source these from AWS Secrets Manager or env injection at the orchestrator layer; local dev uses .env-style overrides.

# Core toggle and observability
NEKSUR_SAAS_AUTH: "1" # REQUIRED in prod — enables runWithSaasAuth (gateway + L3)
NEKSUR_OBSERVABILITY: "1" # REQUIRED in prod — enables OTLP exporter + Prom /metrics
NEKSUR_METRICS_ADDR: ":9100" # Optional override; must match prometheus.yml scrape target
NEKSUR_LISTEN_ADDR: ":8080" # Optional override for the HTTP listener
# Database (Pool A — Postgres 16 + AGE 1.6)
DATABASE_URL: "postgres://neksur:...@..." # REQUIRED — pgxpool DSN; BeforeAcquire DISCARD ALL applied
# WorkOS auth (Phase 0.5 stack)
WORKOS_API_KEY: "sk_live_..." # REQUIRED
WORKOS_CLIENT_ID: "client_..." # REQUIRED
WORKOS_WEBHOOK_SECRET: "whsec_..." # REQUIRED — verified BEFORE SCIM_ENABLED check
WORKOS_INTERNAL_ADMIN_ORG_ID: "org_..." # REQUIRED for /admin/* surface
# L3 detection pool (in-process)
NEKSUR_L3_WORKERS: "4" # Optional; default 4. Raise to 8 if scan lag accumulates
NEKSUR_POLARIS_WEBHOOK_ENABLED: "1" # Optional; "0" returns 410 Gone on /v1/webhooks/polaris
NEKSUR_S3_EVENTS_QUEUE_URL: "" # Optional; when set, starts the SQS long-poll goroutine
NEKSUR_S3_EVENTS_TENANT_ID: "" # CONDITIONAL — required iff NEKSUR_S3_EVENTS_QUEUE_URL is set
NEKSUR_SLACK_WEBHOOK_URL: "" # Optional; when unset, alerts.Slack.Post is a silent no-op
# Billing (only when BILLING_ENABLED=true)
BILLING_ENABLED: "false" # Optional toggle
STRIPE_API_KEY: "" # CONDITIONAL when BILLING_ENABLED=true
STRIPE_WEBHOOK_SECRET: "" # REQUIRED — verified BEFORE BILLING_ENABLED check
# AWS integration
AWS_REGION: "us-east-1" # REQUIRED when NEKSUR_S3_EVENTS_QUEUE_URL is set
PAGERDUTY_SERVICE_ID: "P000000" # Optional; embedded in /admin/incidents UI

Per-variable reference:

VariableRequiredDefaultControls
DATABASE_URLyesPool A RDS DSN. pgxpool config receives BeforeAcquire DISCARD ALL — the load-bearing session-bleed prevention. Never spawn a second pool.
NEKSUR_SAAS_AUTHyes (prod)0Master switch for the SaaS auth + gateway stack. Set to 1 to mount /api/*, /v1/iceberg/*, /v1/lineage, /v1/webhooks/polaris, /admin/*.
NEKSUR_OBSERVABILITYyes (prod)0Enables the OTLP gRPC trace exporter (defaults to localhost:4317) and the Prometheus /metrics server. Must be 1 for the alert rules in Observability to have data.
NEKSUR_METRICS_ADDRno:9100Listener for /metrics. Must match the neksur-graph scrape target in observability/prometheus.yml.
NEKSUR_LISTEN_ADDRno:8080HTTP listener for the gateway + admin routes. ALB target group forwards to this port.
WORKOS_API_KEYyesWorkOS API key (Phase 0.5). Used by workosauth.NewClient for session validation in TenantMiddleware.
WORKOS_CLIENT_IDyesWorkOS OAuth client ID.
WORKOS_WEBHOOK_SECRETyesHMAC secret for POST /webhooks/workos. Signature is verified BEFORE the SCIM_ENABLED check.
WORKOS_INTERNAL_ADMIN_ORG_IDyesWorkOS Org ID for Neksur staff. /admin/* returns 403 unless the session’s org matches.
NEKSUR_L3_WORKERSno4Goroutine-pool size for the in-process L3 detection workers. Raise to 8 if commit traffic is healthy but the detection trigger queue is backing up.
NEKSUR_POLARIS_WEBHOOK_ENABLEDno1Set to 0 to return 410 Gone on POST /v1/webhooks/polaris when the upstream Polaris lacks webhook signing.
NEKSUR_S3_EVENTS_QUEUE_URLnounsetWhen set, starts the SQS long-poll goroutine that drains S3 ObjectCreated events. Format: https://sqs.<region>.amazonaws.com/<acct>/<queue>. Leave unset to rely on the 30-second poller + Polaris webhook only.
NEKSUR_S3_EVENTS_TENANT_IDconditionalunsetREQUIRED iff NEKSUR_S3_EVENTS_QUEUE_URL is set. Phase 1 simplification: one queue per tenant.
NEKSUR_SLACK_WEBHOOK_URLnounsetWhen unset, the Slack client is a silent no-op so the gateway can ship before Slack is configured. Set to https://hooks.slack.com/services/... to enable L3-finding alerts at confidence ≥ 0.85.
STRIPE_WEBHOOK_SECRETyes when BILLING_ENABLED=truePhase 0.5 — verified BEFORE the BILLING_ENABLED branch so spoofed Stripe webhooks are rejected even when billing is off.
BILLING_ENABLEDnofalseToggles billing.NewBilling between the real Stripe path and a no-op stub.
STRIPE_API_KEYconditionalStripe secret key; only consumed when BILLING_ENABLED=true.
AWS_REGIONconditionalREQUIRED when NEKSUR_S3_EVENTS_QUEUE_URL is set OR when CloudWatch metric posting is enabled.
PAGERDUTY_SERVICE_IDnoP000000Embedded into the /admin/incidents page as a deep-link to the PagerDuty dashboard.
ICEBERG_REST_BEARERconditionalOAuth bearer token used by scripts/ci/phase1-gateway-overhead.sh for the nightly overhead drill against both the direct-Polaris baseline and the gateway. Production runtime does not read this directly.

Per-tenant catalog credentials (the upstream Polaris / Nessie URL, OAuth client_id/client_secret, webhook HMAC secret) are stored in the per-tenant catalog_credentials table — they are NOT environment variables. See the saas-onboarding runbook below for the provisioning workflow.

For the local-dev / single-tenant flow — provisioning a tenant, pointing Spark at the gateway, authoring a first policy — see Getting Started.

For the production multi-tenant flow — the 12-step idempotent provisioning script that creates the per-tenant schema, seeds catalog credentials, and registers the tenant in WorkOS — see the source-repo runbook:

  • runbooks/saas-onboarding.md — drives ./scripts/provision-tenant.sh end-to-end. Validation tests live at tests/integration/provisioning_test.go::TestProvisioningIdempotent.

The provisioning script is idempotent at the per-step level: re-running it after a failure resumes from the failed step rather than restarting from scratch.

Phase 1 ships two Prometheus alert rules in observability/rules/phase1-commit-rejected.yml. Both fire off the commit_rejected_total{reason} CounterVec emitted by internal/observability/metrics.go and incremented by the gateway on every rejection.

PolicyEngineUnavailableAlert (severity=page)

Section titled “PolicyEngineUnavailableAlert (severity=page)”

This is the most critical alert in Phase 1. It fires when the L1 Catalog Gateway is rejecting customer Iceberg commits with HTTP 503 fail-closed at a non-zero rate sustained one minute. Per the fail-closed contract, when the policy engine cannot prove a commit is allowed, the commit is REJECTED rather than allowed-by-default — customers writing via Spark / Trino / Snowflake see their commit fail and cannot work around it.

groups:
- name: phase1.commit_rejected
interval: 30s
rules:
- alert: PolicyEngineUnavailableAlert
expr: sum(rate(commit_rejected_total{reason="policy_engine_unavailable"}[1m])) > 0
for: 1m
labels:
severity: page
phase: "1"
component: gateway
contract: D-1.09
annotations:
summary: "Neksur L1 Catalog Gateway is rejecting commits with 503 policy-engine-unavailable (fail-closed)"
runbook_url: "https://github.com/neksur-com/neksur/blob/main/runbooks/commit-rejected-503-rate.md"

Wakes the on-call rotation within one minute of breach. Two fail-closed sub-paths both increment the same label:

  1. Policy fetch failurestore.AGEStore.LoadPoliciesForTable returned a non-nil error. Usually a DB transport error (pool exhausted, RDS failover in flight, RLS GUC unset because TenantMiddleware was not in the chain).
  2. Policy evaluation failurecel.Evaluator.Evaluate returned a non-nil error. Compile error (rare in steady state), eval error, non-bool return, or eval-time panic recovered by the defer/recover in internal/policy/cel/eval.go.

The runbook distinguishes the sub-paths via slog log messages and routes the page accordingly:

  • runbooks/commit-rejected-503-rate.md — full incident response procedure with a four-branch diagnosis tree (DB unreachable, CEL compiler uninitialised, customer policy malformed, RLS GUC missing).

Trend signal — routes to Slack #oncall-warn, does NOT wake on-call. Fires when commit_rejected_total{reason="policy_denied"} rate exceeds 10/sec sustained five minutes. Normal 403 denials are business-as-usual traffic (customer policies correctly blocking non-conforming commits); a sustained spike suggests a recent policy push tightened too aggressively, a misconfigured Spark client is hammering the gateway, or a malicious actor is probing.

- alert: PolicyDeniedSpikeAlert
expr: sum(rate(commit_rejected_total{reason="policy_denied"}[5m])) > 10
for: 5m
labels:
severity: warn
phase: "1"
component: gateway

Routing is severity-tier-driven in observability/alertmanager.yml:

route:
receiver: pagerduty # default — fail-safe if label match misses
routes:
- matchers: [severity = "page"]
receiver: pagerduty
- matchers: [severity = "warn"]
receiver: slack

severity=page alerts inject ${PAGERDUTY_INTEGRATION_KEY} (per-deployment Secrets Manager value). severity=warn alerts post to ${SLACK_WEBHOOK_URL} and land in #oncall-warn. Phase 0 sibling alerts (CypherP99LatencyBreach, ClockSkewPage, ClockSkewWarn) inherit the same routing tree.

The same /metrics endpoint also exposes:

  • commit_rejected_total{reason="policy_denied"} — the 403 counter (drives PolicyDeniedSpikeAlert).
  • cypher_duration_ms_bucket — Phase 0 histogram of every AGE Cypher query’s duration; drives the Phase 0 CypherP99LatencyBreach alert (see runbooks/cypher-latency-breach.md).
  • OTel trace spans via the OTLP gRPC exporter (localhost:4317 by default) — landed into the OTel collector for distributed tracing.

To verify the rule file is mounted into Prometheus after a deploy:

Terminal window
curl -sS http://prometheus:9090/api/v1/rules \
| jq '.data.groups[].rules[].name' \
| grep PolicyEngineUnavailableAlert
# Expected: "PolicyEngineUnavailableAlert"

If the rule is not present, infra/prometheus/prometheus.yml is not loading rules/phase1-commit-rejected.yml — re-mount the file and reload Prometheus.

Phase 1 ships with a hard performance contract: the L1 gateway adds no more than 5% P95 commit latency overhead compared to a direct write to the upstream Polaris catalog (ADR-003 §3.3). The contract is mechanically verified by scripts/ci/phase1-gateway-overhead.sh, which runs nightly via GitHub Actions (.github/workflows/phase1-overhead-nightly.yml).

The driver script:

Terminal window
go run ./tests/load/cmd/gateway_overhead \
-assert-overhead-under=0.05 \
-commit-count=1000 \
-warmup=100 \
-baseline-url="$POLARIS_BASELINE_URL" \
-gateway-url="$GATEWAY_URL" \
-bearer="$ICEBERG_REST_BEARER" \
-baseline-out=tests/load/gateway-overhead-baseline.json

A miss (overhead > 5%) exits non-zero and fires the nightly-cron alert. The wall-clock duration is posted to CloudWatch as Neksur/Phase1/GatewayOverheadDrillDurationSec regardless of pass/fail so the dashboard always reflects the latest observed drill duration.

Operators should expect this gate to run on every nightly CI invocation. A regression here is a Phase 1 acceptance contract violation; bumping iceberg-go past v0.5.0 invalidates the baseline JSON and requires a fresh nightly baseline before the bump can land. The script is at scripts/ci/phase1-gateway-overhead.sh and the baseline JSON at tests/load/gateway-overhead-baseline.json.

Production tenants run ≥ 2 neksur-server replicas behind a single ALB. The HA topology is the operational mitigation for the fail-closed semantics of the policy engine: a single-replica deployment that fail-closes pages on every commit, because there is no sibling replica to absorb traffic while the broken one is replaced. ASG sized 2 → 4.

Two endpoints — operators MUST wire the ALB target group health check to /readyz, not /healthz:

EndpointReturns 200 iff …Use for
GET /healthzpgxpool reachable AND SELECT 1 against AGE LOAD succeedsLiveness probe — ASG instance-replacement trigger
GET /readyz/healthz AND CEL compiler initialised AND detection pool startedALB target group health check

The distinction is load-bearing: when the CEL compiler env fails to initialise, the replica’s /healthz still returns 200 (Postgres is fine) but /readyz returns 503. The ALB removes the replica from rotation; ASG eventually replaces it. The sibling replica handles 100% of commit traffic without paging on fail-closed 503s.

ALB target-group health-check config:

health_check:
enabled: true
path: /readyz # NOT /healthz
protocol: HTTP
matcher: "200"
interval: 15
timeout: 5
healthy_threshold: 2
unhealthy_threshold: 2

The gateway is stateless for routing purposes — any replica can serve any commit — but two pieces of state live in-process per replica:

  1. L3 detection dedup map — a sync.Map keyed by metadata_location deduplicates scans within a single replica. A replica restart wipes the map; cross-replica dedup is enforced at the DB layer by the detection_runs.snapshot_metadata_location UNIQUE constraint, so a duplicated scan request between two replicas results in one successful insert and one ON CONFLICT DO NOTHING.
  2. CEL compiled-program cache — the cel.Compiler LRU cache keys by (policy_id, policy_text). A restart re-warms the cache on first eval of each policy.

Neither piece of state is durable; both are tolerant of restart.

The Phase 1 deployment uses Pool A — a single shared Postgres + AGE RDS cluster for tenant metadata. “Pool B” is the planned per-tenant or sharded-pool model introduced in Phase 2+ for tenants whose workload exceeds the Pool A envelope (10M nodes / 50M edges per the Phase 0 capacity contract). Phase 1 deployments are Pool A only; Pool B migration is documented in the source repo at runbooks/pool-b-migration.md and is not exercised today.

The gateway is I/O bound (most time waiting on upstream Polaris + Postgres); the dominant scaling signal is the rejection rate, NOT CPU:

MetricThresholdASG action
sum(rate(commit_rejected_total{reason="policy_engine_unavailable"}[1m])) > 0 sustained 30sscale-out 2 → 4 replicas
histogram_quantile(0.95, sum(rate(neksur_gateway_commit_duration_ms_bucket[5m])) by (le)) > 250scale-out +1 replica
Replica count < 2 for any reasonpage — manual investigation

Auto-scaling rules beyond these basic triggers (predictive scaling, target-tracking on multiple metrics) are a Phase 4+ roadmap item.

The full deploy procedure — AMI bake, ASG instance refresh with MinHealthyPercentage=50 and InstanceWarmup=60, mid-deploy /readyz verification, and rollback to the last-known-good AMI — is documented in the source repo at runbooks/gateway-deploy.md. The rolling refresh guarantees ≥ 2 replicas during the entire deploy window.

A successful deploy ends with a smoke test that:

  1. Curls /readyz across the entire fleet, expecting 200 from every replica.
  2. Posts a no-op commit to the seeded health-probe table (/v1/iceberg/test-polaris/namespaces/health/tables/probe), expecting 200 (or 403 if a deny-all health-probe policy is in place — both are healthy outcomes; 503 indicates fail-closed).
  3. Verifies commit_rejected_total{reason="policy_engine_unavailable"} is flat.
  4. Verifies the Polaris webhook route is mounted (a signature-less POST /v1/webhooks/polaris returns 401, NOT 410).

If any smoke step fails, capture journalctl -u neksur-server -n 200 on the affected replica and either re-resolve or roll back per the runbook’s §5.4.

The metadata graph is the only durable state Neksur owns. Backup and disaster-recovery procedures live in the source repo:

  • runbooks/restore-pitr.md — pgBackRest point-in-time restore. Contract: RTO 1 hour / RPO 15 minutes (unified across all metadata stores, locking ADR-001 D-001.13). Validation test: tests/dr/restore_pit.sh wrapped by TestRTO_RestorePIT in tests/dr/dr_targets_test.go under build tag dr.
  • runbooks/dr-drill.md — monthly drill that exercises the full restore + chaos-restore RPO arm. Validation test: tests/dr/run_monthly_restore_drill.sh wrapped by TestMonthlyRestoreDrill. The DR drill is the single attestation that the RTO/RPO contract still holds after any infra change.
  • runbooks/failover.md — Patroni / RDS failover procedure for the Pool A cluster.

The pgBackRest repository, WAL streaming configuration, and S3 bucket for backup storage live in the private neksur-com/infra Terraform.

Operators rotate the Phase 0.5 / Phase 1 secrets on a quarterly cadence (or ad-hoc on known compromise). The full rotation procedure for every secret — including the WorkOS API key, the Stripe webhook secret, the database master credential, and per-tenant catalog credentials — is documented in the source repo at runbooks/secret-rotation.md.

Specific secrets operators should rotate:

  • Per-tenant catalog_credentials.client_secret — the OAuth client_secret used by the gateway when forwarding commits to the upstream Polaris / Nessie. Rotated per tenant; the rotation procedure issues a new credential, updates the DB row, and verifies a smoke-test commit before invalidating the old credential.
  • Per-tenant Polaris webhook HMAC secretcatalog_credentials.config_json[webhook_secret]. Used by POST /v1/webhooks/polaris to verify incoming webhook signatures. Rotated coordinated with the upstream Polaris instance so the signature contract holds.
  • NEKSUR_SLACK_WEBHOOK_URL — the Slack incoming webhook for L3 detection findings. Replace via Slack admin UI and update the Secrets Manager entry; the gateway picks up the new value on next restart.
  • ${PAGERDUTY_INTEGRATION_KEY} in alertmanager.yml — the PagerDuty service integration key. Rotated via the PagerDuty admin UI.
  • WORKOS_API_KEY, WORKOS_CLIENT_ID, WORKOS_WEBHOOK_SECRET — the auth-layer secrets. Coordinated with WorkOS admin UI; the gateway picks up the new values on next restart.
  • STRIPE_API_KEY, STRIPE_WEBHOOK_SECRET — the billing-layer secrets, when BILLING_ENABLED=true.

The rotation runbook contains the per-secret step-by-step procedure and the validation commands operators run after each rotation.

The production AWS infrastructure lives in the private neksur-com/neksur-infra Terraform repository (Terraform ≥ 1.7, AWS provider ~> 5.86). It provisions a single-region (us-east-1) topology:

ModuleProvisions
vpc10.0.0.0/16 across 3 AZs, public + private subnets, single NAT (cost-optimized), DNS hostnames enabled (required for customer VPC peering).
rds-pool-aPostgres-on-EC2 HA cluster — Patroni-managed (etcd leader election), Graviton instances + EBS, Postgres 16 + Apache AGE + pgaudit, pgBackRest to S3 (7-day full + 35-day WAL → 35-day PITR window). Postgres-on-EC2 rather than managed RDS because AGE is a non-standard extension.
rds-pool-bDedicated standby pool for tenants whose workload exceeds the Pool A envelope; activated when the first dedicated-pool tenant arrives.
ec2-hostApplication Auto Scaling Group (Docker Compose runtime, IMDSv2 enforced) across private subnets.
alb-nlbPublic ALB (HTTPS 443) + internal NLB (for the pgwire 5432 read path); ACM cert pinning the app / admin / sql / lineage / status hostnames.
private-caSelf-signed root CA + per-customer subordinate CAs for mTLS; keys in Secrets Manager (KMS).
observabilityCloudWatch Alarms → SNS → PagerDuty bridge (Lambda) → StatusPage.
billing-secretsKMS-encrypted Secrets Manager slots (WorkOS, Stripe, PagerDuty, StatusPage, Slack), populated post-apply.
customer-peeringVPC peering (Neksur side); customers consume a shared peering module on their side.

State is in S3 (neksur-tfstate) with a DynamoDB lock table; secrets are passed via TF_VAR_* env vars, never committed to terraform.tfvars. Deploy is the standard terraform init && terraform plan -out && terraform apply, gated in CI by terraform-plan.yml (PR) and terraform-apply.yml (protected merge).

Older revisions of this guide referred to this repository as neksur-com/infra; the repository is neksur-com/neksur-infra.

Neksur runs in three modes from the same binary:

  • SaaS (AWS). The topology documented above — pooled multi-tenancy, WorkOS auth, the neksur-infra Terraform. This is the reference deployment.
  • Self-managed. Operators run neksur-server against their own Postgres 16 + AGE and their own Iceberg catalog, behind their own ingress. The HA, health-check, backup/DR, and observability guidance above applies unchanged.
  • Air-gapped / on-prem. A zero-internet deployment: no CDN dependencies in the web build, an offline signed license file (no license-server callout), OIDC-based governance authz (admin role injected from the OIDC roles claim), and a self-contained demo/bootstrap path. The build integrity gate ensures the distributed artifact has no outbound network requirements.

For the Defense-in-Depth write path, Spark applies Neksur Access transforms before writes hit object storage, via the neksur-com/neksur-spark-policy library (Scala 2.12 / Spark 3.5 / Iceberg 1.6.x / JDK 17). Two equivalent frontends call the same transform library:

Silent — Catalyst extension (intercepts all governed writes):

Terminal window
spark-submit \
--conf spark.sql.extensions=com.neksur.spark.policy.NeksurEnforcementExtension \
--conf spark.neksur.policy.endpoint='https://api.neksur.example.com/v1/policy' \
--conf spark.neksur.policy.token='<bearer_token>' \
app.jar

Explicit — SDK (opt-in per write):

import com.neksur.spark.policy.NeksurDataFrameWriter
new NeksurDataFrameWriter(df, "sales.orders").append()

Both paths are fail-closed: any failure fetching the transform plan raises a SparkException and aborts the write — no governed write proceeds without a successfully applied policy. A CI parity gate (ExtensionVsSdkParitySpec) proves the two paths produce byte-identical Parquet. Iceberg connector version drift matters: a mismatched iceberg-spark-runtime silently disables write interception, so pin the runtime jar to the supported version.

Future phases extend the deployment surface in the following directions:

  • Auto-scaling rules — Phase 4+ ships target-tracking auto-scaling on multiple Prometheus metrics (commit rate, detection-pool depth, per-tenant cardinality), replacing the simple binary scale-out triggers documented above.
  • Multi-region deployment — Phase 6+ adds a multi-region topology with cross-region RDS replication, a global ALB / Route 53 latency-based routing layer, and per-region detection pools. The Phase 1 deployment is single-region by design.
  • Managed-service offering — Phase 7+ packages Neksur Core as a hosted managed service (“Neksur Cloud”) with operator-side onboarding, multi-tenant quota enforcement at the infrastructure layer, and a published Helm chart for self-managed Kubernetes deployments. See neksur-com/neksur BSL → Apache 2.0 license transition (2030-05-10) for the long-term distribution model.

Operators evaluating Neksur today should target a single-region production deployment with the Phase 1 topology and accept that auto-scaling, multi-region, and managed-service capabilities are roadmap items.

  • Architecture — the three-layer enforcement model and the per-commit pipeline that runs inside each replica.
  • Getting Started — point Spark at a freshly deployed gateway and author your first policy.
  • REST API Reference — full endpoint surface, request/response shapes, HTTP status codes.

Deployment guide. Updated 2026-06-06 to cover the full Core surface, infrastructure-as-code, deployment modes, and the Spark integration.