Deployment
This is the canonical guide for deploying Neksur to production AWS. It covers the commit data path in depth — the metadata graph, the catalog gateway, OpenLineage ingestion, and the in-process detection pool — running as a stateless cmd/neksur-server binary behind an Application Load Balancer. The same binary also serves the read-path SQL proxy, pgwire, XMLA, GraphQL, and MCP surfaces; the operational concerns below (HA, fail-closed alerting, health checks, backup/DR) apply to the whole process.
For the infrastructure-as-code, deployment modes (SaaS / self-managed / air-gapped), and the Spark writer-side integration, see the new sections near the end of this guide. The target audience is the SRE / platform team operating Neksur for a single tenant or a small fleet of tenants.
Editions. Neksur is a single binary; commercial-tier coordination (multi-engine, defense-in-depth, intelligence) is unlocked by a signed license verified at startup — there is no separate build to deploy. See Editions and tiers.
Companion docs. Read Architecture before this guide — the three-layer enforcement model and the AGE graph foundation set the context for every infrastructure choice below. After install, see Getting Started for the end-to-end dev flow and the REST API Reference for the endpoint surface the gateway exposes.
Prerequisites
Section titled “Prerequisites”Operators bringing up a new Neksur deployment need the following before they start:
- AWS account with permissions to provision RDS (Postgres 16 + AGE 1.6), an Application Load Balancer, an EC2 Auto Scaling Group, ACM certificates, Secrets Manager entries, and (optionally) SNS + SQS for S3 ObjectCreated event delivery.
- Terraform CLI 1.6+ for driving the private
neksur-com/infrarepository that holds the production Terraform modules. - Iceberg catalog endpoint — a reachable Apache Polaris OR Project Nessie instance. Phase 1 ships live adapters for both; Glue and Unity Catalog adapters are stub-only and rejected at startup until Phase 3.
- OIDC identity provider — WorkOS is the Phase 0.5 / Phase 1 default;
WORKOS_API_KEY,WORKOS_CLIENT_ID, andWORKOS_WEBHOOK_SECRETmust be available in Secrets Manager before the gateway will start in SaaS mode. - Go 1.24+ on the build host if you are baking your own AMI rather than consuming a published artifact.
Optional but strongly recommended for production:
- PagerDuty integration key — populates
${PAGERDUTY_INTEGRATION_KEY}inalertmanager.ymland routesseverity=pagealerts (most importantlyPolicyEngineUnavailableAlert) to the on-call rotation. - Slack incoming webhook URL — populates
${SLACK_WEBHOOK_URL}forseverity=warnalerts (e.g.,PolicyDeniedSpikeAlert) and, separately,NEKSUR_SLACK_WEBHOOK_URLfor L3 detection findings at confidence ≥ 0.85. - AWS Secrets Manager for all of the above;
cmd/neksur-server/main.goreads them as plain env vars, and Phase 1 production deployments inject them via the Terraform layer.
A staging tenant with a single replica is acceptable for evaluation; production tenants MUST run with ≥ 2 replicas behind the ALB — see HA + scaling below for the rationale.
Reference architecture
Section titled “Reference architecture”The Phase 1 production topology is a single-region AWS deployment with a stateless gateway tier in front of a shared metadata RDS and an optional SQS event source. The diagram below is the operator’s-eye map; for the per-commit pipeline that runs inside each replica, see Architecture.
graph TD Client["Iceberg client<br/>(Spark / Trino / pyiceberg)"] ALB["AWS ALB<br/>TLS termination<br/>health check: GET /readyz"] R1["neksur-server replica 1<br/>L1 gateway + L3 pool<br/>(in-process)"] R2["neksur-server replica 2<br/>L1 gateway + L3 pool"] R3["neksur-server replica N<br/>(ASG: min 2, max 4)"] RDS["Postgres 16 + AGE 1.6<br/>(Pool A, RDS Multi-AZ<br/>for production)"] Upstream["Upstream Iceberg catalog<br/>(Polaris / Nessie)"] S3["S3 bucket<br/>(Iceberg data + metadata)"] SNS["SNS topic<br/>(S3 ObjectCreated, optional)"] SQS["SQS queue<br/>(shared, competitive consume)"] Prom["Prometheus<br/>scrape :9100/metrics"] AM["AlertManager<br/>severity-tier routing"] PD["PagerDuty<br/>severity=page"] Slack["Slack<br/>severity=warn + L3 findings"]
Client -->|HTTPS 443| ALB ALB -->|HTTP 8080| R1 ALB -->|HTTP 8080| R2 ALB -->|HTTP 8080| R3 R1 --> RDS R2 --> RDS R3 --> RDS R1 -->|forward commits| Upstream R2 -->|forward commits| Upstream R3 -->|forward commits| Upstream S3 --> SNS SNS --> SQS SQS -->|competitive consume| R1 SQS -->|competitive consume| R2 SQS -->|competitive consume| R3 R1 -.->|/metrics :9100| Prom R2 -.->|/metrics :9100| Prom R3 -.->|/metrics :9100| Prom Prom --> AM AM -->|severity=page| PD AM -->|severity=warn| SlackComponent notes:
- ALB terminates TLS on port 443 with an ACM certificate. Target group health check probes
GET /readyz(NOT/healthz— see Health-check endpoints below). RoutesPOST /v1/iceberg/*,POST /v1/lineage,POST /v1/webhooks/polaris, plus/api/*,/admin/*,/callback,/webhooks/workos, and/webhooks/stripeto the gateway target group. neksur-serverECS / EC2 replicas are stateless from the operator’s perspective for traffic routing — any replica can serve any commit. The L3 detection dedup map is in-process per replica, but cross-replica dedup is enforced at the database layer by thedetection_runs.snapshot_metadata_locationUNIQUE constraint, so a replica restart is safe. Auto-Scaling Group sized 2 → 4.- Postgres 16 + AGE 1.6 RDS is the canonical metadata store (“Pool A”). Single-AZ is acceptable for non-prod; Multi-AZ is required for production so a single-AZ outage does not stop commit traffic. Each gateway replica opens its own
pgxpoolwith theBeforeAcquire DISCARD ALLhook applied; do not add a second pool inside the gateway process. - S3 + SNS + SQS is optional and only required when the L3 detection trigger needs to catch direct-to-S3 writes that bypass the gateway. When
NEKSUR_S3_EVENTS_QUEUE_URLis unset, only the 30-second poller and the Polaris webhook fire and the SQS goroutine is not started. The SQS queue is shared across replicas and consumed competitively; the first replica to callReceiveMessage+DeleteMessagewins each message. - Upstream Iceberg catalog is your existing Polaris or Nessie deployment. Neksur does not replace it — the gateway proxies commits through to it after policy evaluation. Glue and Unity Catalog adapters are stub-only in Phase 1 and the gateway rejects configurations with
catalog_kind = "glue"or"unity"at startup. - Prometheus + AlertManager scrape each replica’s
/metricsendpoint on port:9100and apply severity-tier routing (severity=page→ PagerDuty;severity=warn→ Slack). See Observability for the alert definitions.
Production-specific infrastructure details — Terraform module names, VPC subnet layout, security group rules, IAM role bundles, DNS records — live in the private neksur-com/infra repository and are not enumerated here.
Environment variables
Section titled “Environment variables”Every variable below is read by cmd/neksur-server/main.go (the Phase 1 entry point). Production deployments source these from AWS Secrets Manager or env injection at the orchestrator layer; local dev uses .env-style overrides.
# Core toggle and observabilityNEKSUR_SAAS_AUTH: "1" # REQUIRED in prod — enables runWithSaasAuth (gateway + L3)NEKSUR_OBSERVABILITY: "1" # REQUIRED in prod — enables OTLP exporter + Prom /metricsNEKSUR_METRICS_ADDR: ":9100" # Optional override; must match prometheus.yml scrape targetNEKSUR_LISTEN_ADDR: ":8080" # Optional override for the HTTP listener
# Database (Pool A — Postgres 16 + AGE 1.6)DATABASE_URL: "postgres://neksur:...@..." # REQUIRED — pgxpool DSN; BeforeAcquire DISCARD ALL applied
# WorkOS auth (Phase 0.5 stack)WORKOS_API_KEY: "sk_live_..." # REQUIREDWORKOS_CLIENT_ID: "client_..." # REQUIREDWORKOS_WEBHOOK_SECRET: "whsec_..." # REQUIRED — verified BEFORE SCIM_ENABLED checkWORKOS_INTERNAL_ADMIN_ORG_ID: "org_..." # REQUIRED for /admin/* surface
# L3 detection pool (in-process)NEKSUR_L3_WORKERS: "4" # Optional; default 4. Raise to 8 if scan lag accumulatesNEKSUR_POLARIS_WEBHOOK_ENABLED: "1" # Optional; "0" returns 410 Gone on /v1/webhooks/polarisNEKSUR_S3_EVENTS_QUEUE_URL: "" # Optional; when set, starts the SQS long-poll goroutineNEKSUR_S3_EVENTS_TENANT_ID: "" # CONDITIONAL — required iff NEKSUR_S3_EVENTS_QUEUE_URL is setNEKSUR_SLACK_WEBHOOK_URL: "" # Optional; when unset, alerts.Slack.Post is a silent no-op
# Billing (only when BILLING_ENABLED=true)BILLING_ENABLED: "false" # Optional toggleSTRIPE_API_KEY: "" # CONDITIONAL when BILLING_ENABLED=trueSTRIPE_WEBHOOK_SECRET: "" # REQUIRED — verified BEFORE BILLING_ENABLED check
# AWS integrationAWS_REGION: "us-east-1" # REQUIRED when NEKSUR_S3_EVENTS_QUEUE_URL is setPAGERDUTY_SERVICE_ID: "P000000" # Optional; embedded in /admin/incidents UIPer-variable reference:
| Variable | Required | Default | Controls |
|---|---|---|---|
DATABASE_URL | yes | — | Pool A RDS DSN. pgxpool config receives BeforeAcquire DISCARD ALL — the load-bearing session-bleed prevention. Never spawn a second pool. |
NEKSUR_SAAS_AUTH | yes (prod) | 0 | Master switch for the SaaS auth + gateway stack. Set to 1 to mount /api/*, /v1/iceberg/*, /v1/lineage, /v1/webhooks/polaris, /admin/*. |
NEKSUR_OBSERVABILITY | yes (prod) | 0 | Enables the OTLP gRPC trace exporter (defaults to localhost:4317) and the Prometheus /metrics server. Must be 1 for the alert rules in Observability to have data. |
NEKSUR_METRICS_ADDR | no | :9100 | Listener for /metrics. Must match the neksur-graph scrape target in observability/prometheus.yml. |
NEKSUR_LISTEN_ADDR | no | :8080 | HTTP listener for the gateway + admin routes. ALB target group forwards to this port. |
WORKOS_API_KEY | yes | — | WorkOS API key (Phase 0.5). Used by workosauth.NewClient for session validation in TenantMiddleware. |
WORKOS_CLIENT_ID | yes | — | WorkOS OAuth client ID. |
WORKOS_WEBHOOK_SECRET | yes | — | HMAC secret for POST /webhooks/workos. Signature is verified BEFORE the SCIM_ENABLED check. |
WORKOS_INTERNAL_ADMIN_ORG_ID | yes | — | WorkOS Org ID for Neksur staff. /admin/* returns 403 unless the session’s org matches. |
NEKSUR_L3_WORKERS | no | 4 | Goroutine-pool size for the in-process L3 detection workers. Raise to 8 if commit traffic is healthy but the detection trigger queue is backing up. |
NEKSUR_POLARIS_WEBHOOK_ENABLED | no | 1 | Set to 0 to return 410 Gone on POST /v1/webhooks/polaris when the upstream Polaris lacks webhook signing. |
NEKSUR_S3_EVENTS_QUEUE_URL | no | unset | When set, starts the SQS long-poll goroutine that drains S3 ObjectCreated events. Format: https://sqs.<region>.amazonaws.com/<acct>/<queue>. Leave unset to rely on the 30-second poller + Polaris webhook only. |
NEKSUR_S3_EVENTS_TENANT_ID | conditional | unset | REQUIRED iff NEKSUR_S3_EVENTS_QUEUE_URL is set. Phase 1 simplification: one queue per tenant. |
NEKSUR_SLACK_WEBHOOK_URL | no | unset | When unset, the Slack client is a silent no-op so the gateway can ship before Slack is configured. Set to https://hooks.slack.com/services/... to enable L3-finding alerts at confidence ≥ 0.85. |
STRIPE_WEBHOOK_SECRET | yes when BILLING_ENABLED=true | — | Phase 0.5 — verified BEFORE the BILLING_ENABLED branch so spoofed Stripe webhooks are rejected even when billing is off. |
BILLING_ENABLED | no | false | Toggles billing.NewBilling between the real Stripe path and a no-op stub. |
STRIPE_API_KEY | conditional | — | Stripe secret key; only consumed when BILLING_ENABLED=true. |
AWS_REGION | conditional | — | REQUIRED when NEKSUR_S3_EVENTS_QUEUE_URL is set OR when CloudWatch metric posting is enabled. |
PAGERDUTY_SERVICE_ID | no | P000000 | Embedded into the /admin/incidents page as a deep-link to the PagerDuty dashboard. |
ICEBERG_REST_BEARER | conditional | — | OAuth bearer token used by scripts/ci/phase1-gateway-overhead.sh for the nightly overhead drill against both the direct-Polaris baseline and the gateway. Production runtime does not read this directly. |
Per-tenant catalog credentials (the upstream Polaris / Nessie URL, OAuth client_id/client_secret, webhook HMAC secret) are stored in the per-tenant catalog_credentials table — they are NOT environment variables. See the saas-onboarding runbook below for the provisioning workflow.
Tenant onboarding workflow
Section titled “Tenant onboarding workflow”For the local-dev / single-tenant flow — provisioning a tenant, pointing Spark at the gateway, authoring a first policy — see Getting Started.
For the production multi-tenant flow — the 12-step idempotent provisioning script that creates the per-tenant schema, seeds catalog credentials, and registers the tenant in WorkOS — see the source-repo runbook:
runbooks/saas-onboarding.md— drives./scripts/provision-tenant.shend-to-end. Validation tests live attests/integration/provisioning_test.go::TestProvisioningIdempotent.
The provisioning script is idempotent at the per-step level: re-running it after a failure resumes from the failed step rather than restarting from scratch.
Observability
Section titled “Observability”Phase 1 ships two Prometheus alert rules in observability/rules/phase1-commit-rejected.yml. Both fire off the commit_rejected_total{reason} CounterVec emitted by internal/observability/metrics.go and incremented by the gateway on every rejection.
PolicyEngineUnavailableAlert (severity=page)
Section titled “PolicyEngineUnavailableAlert (severity=page)”This is the most critical alert in Phase 1. It fires when the L1 Catalog Gateway is rejecting customer Iceberg commits with HTTP 503 fail-closed at a non-zero rate sustained one minute. Per the fail-closed contract, when the policy engine cannot prove a commit is allowed, the commit is REJECTED rather than allowed-by-default — customers writing via Spark / Trino / Snowflake see their commit fail and cannot work around it.
groups: - name: phase1.commit_rejected interval: 30s rules: - alert: PolicyEngineUnavailableAlert expr: sum(rate(commit_rejected_total{reason="policy_engine_unavailable"}[1m])) > 0 for: 1m labels: severity: page phase: "1" component: gateway contract: D-1.09 annotations: summary: "Neksur L1 Catalog Gateway is rejecting commits with 503 policy-engine-unavailable (fail-closed)" runbook_url: "https://github.com/neksur-com/neksur/blob/main/runbooks/commit-rejected-503-rate.md"Wakes the on-call rotation within one minute of breach. Two fail-closed sub-paths both increment the same label:
- Policy fetch failure —
store.AGEStore.LoadPoliciesForTablereturned a non-nil error. Usually a DB transport error (pool exhausted, RDS failover in flight, RLS GUC unset becauseTenantMiddlewarewas not in the chain). - Policy evaluation failure —
cel.Evaluator.Evaluatereturned a non-nil error. Compile error (rare in steady state), eval error, non-bool return, or eval-time panic recovered by thedefer/recoverininternal/policy/cel/eval.go.
The runbook distinguishes the sub-paths via slog log messages and routes the page accordingly:
runbooks/commit-rejected-503-rate.md— full incident response procedure with a four-branch diagnosis tree (DB unreachable, CEL compiler uninitialised, customer policy malformed, RLS GUC missing).
PolicyDeniedSpikeAlert (severity=warn)
Section titled “PolicyDeniedSpikeAlert (severity=warn)”Trend signal — routes to Slack #oncall-warn, does NOT wake on-call. Fires when commit_rejected_total{reason="policy_denied"} rate exceeds 10/sec sustained five minutes. Normal 403 denials are business-as-usual traffic (customer policies correctly blocking non-conforming commits); a sustained spike suggests a recent policy push tightened too aggressively, a misconfigured Spark client is hammering the gateway, or a malicious actor is probing.
- alert: PolicyDeniedSpikeAlert expr: sum(rate(commit_rejected_total{reason="policy_denied"}[5m])) > 10 for: 5m labels: severity: warn phase: "1" component: gatewayAlert routing
Section titled “Alert routing”Routing is severity-tier-driven in observability/alertmanager.yml:
route: receiver: pagerduty # default — fail-safe if label match misses routes: - matchers: [severity = "page"] receiver: pagerduty - matchers: [severity = "warn"] receiver: slackseverity=page alerts inject ${PAGERDUTY_INTEGRATION_KEY} (per-deployment Secrets Manager value). severity=warn alerts post to ${SLACK_WEBHOOK_URL} and land in #oncall-warn. Phase 0 sibling alerts (CypherP99LatencyBreach, ClockSkewPage, ClockSkewWarn) inherit the same routing tree.
Other Phase 1 metrics
Section titled “Other Phase 1 metrics”The same /metrics endpoint also exposes:
commit_rejected_total{reason="policy_denied"}— the 403 counter (drivesPolicyDeniedSpikeAlert).cypher_duration_ms_bucket— Phase 0 histogram of every AGE Cypher query’s duration; drives the Phase 0CypherP99LatencyBreachalert (seerunbooks/cypher-latency-breach.md).- OTel trace spans via the OTLP gRPC exporter (
localhost:4317by default) — landed into the OTel collector for distributed tracing.
To verify the rule file is mounted into Prometheus after a deploy:
curl -sS http://prometheus:9090/api/v1/rules \ | jq '.data.groups[].rules[].name' \ | grep PolicyEngineUnavailableAlert# Expected: "PolicyEngineUnavailableAlert"If the rule is not present, infra/prometheus/prometheus.yml is not loading rules/phase1-commit-rejected.yml — re-mount the file and reload Prometheus.
The ≤5% gateway overhead invariant
Section titled “The ≤5% gateway overhead invariant”Phase 1 ships with a hard performance contract: the L1 gateway adds no more than 5% P95 commit latency overhead compared to a direct write to the upstream Polaris catalog (ADR-003 §3.3). The contract is mechanically verified by scripts/ci/phase1-gateway-overhead.sh, which runs nightly via GitHub Actions (.github/workflows/phase1-overhead-nightly.yml).
The driver script:
go run ./tests/load/cmd/gateway_overhead \ -assert-overhead-under=0.05 \ -commit-count=1000 \ -warmup=100 \ -baseline-url="$POLARIS_BASELINE_URL" \ -gateway-url="$GATEWAY_URL" \ -bearer="$ICEBERG_REST_BEARER" \ -baseline-out=tests/load/gateway-overhead-baseline.jsonA miss (overhead > 5%) exits non-zero and fires the nightly-cron alert. The wall-clock duration is posted to CloudWatch as Neksur/Phase1/GatewayOverheadDrillDurationSec regardless of pass/fail so the dashboard always reflects the latest observed drill duration.
Operators should expect this gate to run on every nightly CI invocation. A regression here is a Phase 1 acceptance contract violation; bumping iceberg-go past v0.5.0 invalidates the baseline JSON and requires a fresh nightly baseline before the bump can land. The script is at scripts/ci/phase1-gateway-overhead.sh and the baseline JSON at tests/load/gateway-overhead-baseline.json.
HA + scaling
Section titled “HA + scaling”Topology
Section titled “Topology”Production tenants run ≥ 2 neksur-server replicas behind a single ALB. The HA topology is the operational mitigation for the fail-closed semantics of the policy engine: a single-replica deployment that fail-closes pages on every commit, because there is no sibling replica to absorb traffic while the broken one is replaced. ASG sized 2 → 4.
Health-check endpoints
Section titled “Health-check endpoints”Two endpoints — operators MUST wire the ALB target group health check to /readyz, not /healthz:
| Endpoint | Returns 200 iff … | Use for |
|---|---|---|
GET /healthz | pgxpool reachable AND SELECT 1 against AGE LOAD succeeds | Liveness probe — ASG instance-replacement trigger |
GET /readyz | /healthz AND CEL compiler initialised AND detection pool started | ALB target group health check |
The distinction is load-bearing: when the CEL compiler env fails to initialise, the replica’s /healthz still returns 200 (Postgres is fine) but /readyz returns 503. The ALB removes the replica from rotation; ASG eventually replaces it. The sibling replica handles 100% of commit traffic without paging on fail-closed 503s.
ALB target-group health-check config:
health_check: enabled: true path: /readyz # NOT /healthz protocol: HTTP matcher: "200" interval: 15 timeout: 5 healthy_threshold: 2 unhealthy_threshold: 2Statefulness caveats
Section titled “Statefulness caveats”The gateway is stateless for routing purposes — any replica can serve any commit — but two pieces of state live in-process per replica:
- L3 detection dedup map — a
sync.Mapkeyed bymetadata_locationdeduplicates scans within a single replica. A replica restart wipes the map; cross-replica dedup is enforced at the DB layer by thedetection_runs.snapshot_metadata_locationUNIQUE constraint, so a duplicated scan request between two replicas results in one successful insert and oneON CONFLICT DO NOTHING. - CEL compiled-program cache — the
cel.CompilerLRU cache keys by(policy_id, policy_text). A restart re-warms the cache on first eval of each policy.
Neither piece of state is durable; both are tolerant of restart.
Pool A vs Pool B
Section titled “Pool A vs Pool B”The Phase 1 deployment uses Pool A — a single shared Postgres + AGE RDS cluster for tenant metadata. “Pool B” is the planned per-tenant or sharded-pool model introduced in Phase 2+ for tenants whose workload exceeds the Pool A envelope (10M nodes / 50M edges per the Phase 0 capacity contract). Phase 1 deployments are Pool A only; Pool B migration is documented in the source repo at runbooks/pool-b-migration.md and is not exercised today.
Scaling triggers
Section titled “Scaling triggers”The gateway is I/O bound (most time waiting on upstream Polaris + Postgres); the dominant scaling signal is the rejection rate, NOT CPU:
| Metric | Threshold | ASG action |
|---|---|---|
sum(rate(commit_rejected_total{reason="policy_engine_unavailable"}[1m])) > 0 sustained 30s | scale-out 2 → 4 replicas | |
histogram_quantile(0.95, sum(rate(neksur_gateway_commit_duration_ms_bucket[5m])) by (le)) > 250 | scale-out +1 replica | |
| Replica count < 2 for any reason | page — manual investigation |
Auto-scaling rules beyond these basic triggers (predictive scaling, target-tracking on multiple metrics) are a Phase 4+ roadmap item.
Deploy procedure (rolling blue-green)
Section titled “Deploy procedure (rolling blue-green)”The full deploy procedure — AMI bake, ASG instance refresh with MinHealthyPercentage=50 and InstanceWarmup=60, mid-deploy /readyz verification, and rollback to the last-known-good AMI — is documented in the source repo at runbooks/gateway-deploy.md. The rolling refresh guarantees ≥ 2 replicas during the entire deploy window.
A successful deploy ends with a smoke test that:
- Curls
/readyzacross the entire fleet, expecting200from every replica. - Posts a no-op commit to the seeded health-probe table (
/v1/iceberg/test-polaris/namespaces/health/tables/probe), expecting200(or403if a deny-all health-probe policy is in place — both are healthy outcomes;503indicates fail-closed). - Verifies
commit_rejected_total{reason="policy_engine_unavailable"}is flat. - Verifies the Polaris webhook route is mounted (a signature-less
POST /v1/webhooks/polarisreturns401, NOT410).
If any smoke step fails, capture journalctl -u neksur-server -n 200 on the affected replica and either re-resolve or roll back per the runbook’s §5.4.
Backup, recovery, DR
Section titled “Backup, recovery, DR”The metadata graph is the only durable state Neksur owns. Backup and disaster-recovery procedures live in the source repo:
runbooks/restore-pitr.md— pgBackRest point-in-time restore. Contract: RTO 1 hour / RPO 15 minutes (unified across all metadata stores, locking ADR-001 D-001.13). Validation test:tests/dr/restore_pit.shwrapped byTestRTO_RestorePITintests/dr/dr_targets_test.gounder build tagdr.runbooks/dr-drill.md— monthly drill that exercises the full restore + chaos-restore RPO arm. Validation test:tests/dr/run_monthly_restore_drill.shwrapped byTestMonthlyRestoreDrill. The DR drill is the single attestation that the RTO/RPO contract still holds after any infra change.runbooks/failover.md— Patroni / RDS failover procedure for the Pool A cluster.
The pgBackRest repository, WAL streaming configuration, and S3 bucket for backup storage live in the private neksur-com/infra Terraform.
Secret rotation
Section titled “Secret rotation”Operators rotate the Phase 0.5 / Phase 1 secrets on a quarterly cadence (or ad-hoc on known compromise). The full rotation procedure for every secret — including the WorkOS API key, the Stripe webhook secret, the database master credential, and per-tenant catalog credentials — is documented in the source repo at runbooks/secret-rotation.md.
Specific secrets operators should rotate:
- Per-tenant
catalog_credentials.client_secret— the OAuth client_secret used by the gateway when forwarding commits to the upstream Polaris / Nessie. Rotated per tenant; the rotation procedure issues a new credential, updates the DB row, and verifies a smoke-test commit before invalidating the old credential. - Per-tenant Polaris webhook HMAC secret —
catalog_credentials.config_json[webhook_secret]. Used byPOST /v1/webhooks/polaristo verify incoming webhook signatures. Rotated coordinated with the upstream Polaris instance so the signature contract holds. NEKSUR_SLACK_WEBHOOK_URL— the Slack incoming webhook for L3 detection findings. Replace via Slack admin UI and update the Secrets Manager entry; the gateway picks up the new value on next restart.${PAGERDUTY_INTEGRATION_KEY}inalertmanager.yml— the PagerDuty service integration key. Rotated via the PagerDuty admin UI.WORKOS_API_KEY,WORKOS_CLIENT_ID,WORKOS_WEBHOOK_SECRET— the auth-layer secrets. Coordinated with WorkOS admin UI; the gateway picks up the new values on next restart.STRIPE_API_KEY,STRIPE_WEBHOOK_SECRET— the billing-layer secrets, whenBILLING_ENABLED=true.
The rotation runbook contains the per-secret step-by-step procedure and the validation commands operators run after each rotation.
Infrastructure as code (neksur-infra)
Section titled “Infrastructure as code (neksur-infra)”The production AWS infrastructure lives in the private neksur-com/neksur-infra Terraform repository (Terraform ≥ 1.7, AWS provider ~> 5.86). It provisions a single-region (us-east-1) topology:
| Module | Provisions |
|---|---|
vpc | 10.0.0.0/16 across 3 AZs, public + private subnets, single NAT (cost-optimized), DNS hostnames enabled (required for customer VPC peering). |
rds-pool-a | Postgres-on-EC2 HA cluster — Patroni-managed (etcd leader election), Graviton instances + EBS, Postgres 16 + Apache AGE + pgaudit, pgBackRest to S3 (7-day full + 35-day WAL → 35-day PITR window). Postgres-on-EC2 rather than managed RDS because AGE is a non-standard extension. |
rds-pool-b | Dedicated standby pool for tenants whose workload exceeds the Pool A envelope; activated when the first dedicated-pool tenant arrives. |
ec2-host | Application Auto Scaling Group (Docker Compose runtime, IMDSv2 enforced) across private subnets. |
alb-nlb | Public ALB (HTTPS 443) + internal NLB (for the pgwire 5432 read path); ACM cert pinning the app / admin / sql / lineage / status hostnames. |
private-ca | Self-signed root CA + per-customer subordinate CAs for mTLS; keys in Secrets Manager (KMS). |
observability | CloudWatch Alarms → SNS → PagerDuty bridge (Lambda) → StatusPage. |
billing-secrets | KMS-encrypted Secrets Manager slots (WorkOS, Stripe, PagerDuty, StatusPage, Slack), populated post-apply. |
customer-peering | VPC peering (Neksur side); customers consume a shared peering module on their side. |
State is in S3 (neksur-tfstate) with a DynamoDB lock table; secrets are passed via TF_VAR_* env vars, never committed to terraform.tfvars. Deploy is the standard terraform init && terraform plan -out && terraform apply, gated in CI by terraform-plan.yml (PR) and terraform-apply.yml (protected merge).
Older revisions of this guide referred to this repository as
neksur-com/infra; the repository isneksur-com/neksur-infra.
Deployment modes
Section titled “Deployment modes”Neksur runs in three modes from the same binary:
- SaaS (AWS). The topology documented above — pooled multi-tenancy, WorkOS auth, the
neksur-infraTerraform. This is the reference deployment. - Self-managed. Operators run
neksur-serveragainst their own Postgres 16 + AGE and their own Iceberg catalog, behind their own ingress. The HA, health-check, backup/DR, and observability guidance above applies unchanged. - Air-gapped / on-prem. A zero-internet deployment: no CDN dependencies in the web build, an offline signed license file (no license-server callout), OIDC-based governance authz (admin role injected from the OIDC
rolesclaim), and a self-contained demo/bootstrap path. The build integrity gate ensures the distributed artifact has no outbound network requirements.
Spark writer-side integration
Section titled “Spark writer-side integration”For the Defense-in-Depth write path, Spark applies Neksur Access transforms before writes hit object storage, via the neksur-com/neksur-spark-policy library (Scala 2.12 / Spark 3.5 / Iceberg 1.6.x / JDK 17). Two equivalent frontends call the same transform library:
Silent — Catalyst extension (intercepts all governed writes):
spark-submit \ --conf spark.sql.extensions=com.neksur.spark.policy.NeksurEnforcementExtension \ --conf spark.neksur.policy.endpoint='https://api.neksur.example.com/v1/policy' \ --conf spark.neksur.policy.token='<bearer_token>' \ app.jarExplicit — SDK (opt-in per write):
import com.neksur.spark.policy.NeksurDataFrameWriternew NeksurDataFrameWriter(df, "sales.orders").append()Both paths are fail-closed: any failure fetching the transform plan raises a SparkException and aborts the write — no governed write proceeds without a successfully applied policy. A CI parity gate (ExtensionVsSdkParitySpec) proves the two paths produce byte-identical Parquet. Iceberg connector version drift matters: a mismatched iceberg-spark-runtime silently disables write interception, so pin the runtime jar to the supported version.
What’s planned (roadmap)
Section titled “What’s planned (roadmap)”Future phases extend the deployment surface in the following directions:
- Auto-scaling rules — Phase 4+ ships target-tracking auto-scaling on multiple Prometheus metrics (commit rate, detection-pool depth, per-tenant cardinality), replacing the simple binary scale-out triggers documented above.
- Multi-region deployment — Phase 6+ adds a multi-region topology with cross-region RDS replication, a global ALB / Route 53 latency-based routing layer, and per-region detection pools. The Phase 1 deployment is single-region by design.
- Managed-service offering — Phase 7+ packages Neksur Core as a hosted managed service (“Neksur Cloud”) with operator-side onboarding, multi-tenant quota enforcement at the infrastructure layer, and a published Helm chart for self-managed Kubernetes deployments. See
neksur-com/neksurBSL → Apache 2.0 license transition (2030-05-10) for the long-term distribution model.
Operators evaluating Neksur today should target a single-region production deployment with the Phase 1 topology and accept that auto-scaling, multi-region, and managed-service capabilities are roadmap items.
Where to next
Section titled “Where to next”- Architecture — the three-layer enforcement model and the per-commit pipeline that runs inside each replica.
- Getting Started — point Spark at a freshly deployed gateway and author your first policy.
- REST API Reference — full endpoint surface, request/response shapes, HTTP status codes.
Deployment guide. Updated 2026-06-06 to cover the full Core surface, infrastructure-as-code, deployment modes, and the Spark integration.