Under the Hood: The Engineering Behind NOBA

2026-04-01

NOBA is an infrastructure command center — a server, a fleet of distributed agents, a mobile companion, and 94,000 lines of Python and Vue tying them together. 44 API routers, 431 endpoints, a 6-layer self-healing pipeline, a zero-dependency cross-platform agent, and implementations of WebAuthn, SAML 2.0, and SCIM 2.0 that were built from scratch. It deploys on bare metal, Docker, or as a .deb package. Agents fan out to remote sites over encrypted WebSocket. A React Native app puts alerts and healing approvals in your pocket.

This post walks through the engineering.

The healing pipeline

Self-healing isn't a marketing term here. It's a 4,700-line subsystem spanning 25 Python modules with six discrete layers:

Correlation — First-event-immediate with a 60-second absorption window. The first alert for a target fires a heal request instantly. Subsequent alerts for the same target within the window are absorbed to prevent duplicate work.
Dependency analysis — A directed acyclic graph of monitored targets. When a target fails, the pipeline walks its ancestors. If a parent is also failing, the child's healing is suppressed — you fix the root cause, not the symptoms. External nodes (things NOBA can't heal) are tracked separately.
Site isolation — When an agent goes unreachable, its entire site is marked connectivity-suspect. All healing for targets at that site is suppressed until the agent reconnects. This prevents a network blip from triggering a cascade of false restarts.
Trust governor — Five graduated trust levels: observation, dry-run, notify, approve, execute. Promotion requires 10+ successful executions, an 85%+ verified success rate, and 7+ days at the current level. A circuit breaker trips after 3 failed heals within an hour, demoting the rule back to notify.
Planning — Walks escalation chains up to 6 steps deep. Actions with less than 30% historical success rate for the same condition and target are skipped automatically.
Execution + verification — Capped at 10 concurrent heals via semaphore. After executing an action, the pipeline waits a settle time (15 seconds for a container restart, 120 seconds for a playbook), then re-evaluates the original condition against fresh metrics. If the condition clears, the heal is verified.

On top of this: predictive healing evaluates capacity trends with 24-hour and 72-hour horizons. Auto-discovery detects co-failure patterns (targets that fail within 120 seconds of each other, 3+ times) and suggests dependency edges — but requires operator confirmation before acting. Chaos testing covers 12 scenarios including dependency cascade suppression, heal storm circuit breakers, and manual-fix race conditions. Pre-heal snapshots capture container state, systemd unit config, and disk usage for rollback.

The integration registry backs this with 30 categories covering 635 handler methods across 110+ platforms. 55 remediation action types, each with defined risk levels, timeouts, settle times, and reversibility flags. 23 default escalation chains for common scenarios.

The agent

The remote agent is a single Python zipapp (.pyz) with zero external dependencies. It runs on Python 3.6+ across Linux, Windows, macOS, and FreeBSD. It reads /proc and /sys directly. The WebSocket client is a hand-built RFC 6455 implementation using only the Python standard library.

Remote desktop

The remote desktop implementation is 1,540 lines supporting five capture backends, tried in priority order:

GNOME/Mutter (Wayland) — Creates a persistent D-Bus session to org.gnome.Mutter.RemoteDesktop and org.gnome.Mutter.ScreenCast. Captures frames via a PipeWire/GStreamer pipeline. Input injection uses Mutter's D-Bus methods with Linux evdev keycodes (not X11). A custom binary protocol (NOBR magic header + dimensions + raw RGB) streams frames from the subprocess.
GNOME Screenshot API — D-Bus fallback for GNOME without PipeWire.
grim (wlr-screencopy) — For Sway and other wlroots compositors. Auto-discovers the WAYLAND_DISPLAY socket.
X11 via ctypes — Loads libX11.so.6 and libXtst.so.6 at runtime. Captures via XGetImage on the root window with a 4-second timeout guard (some compositors block X11 capture). BGRA-to-RGB conversion with stride handling for non-contiguous scan lines. Input injection via XTest.
Windows GDI / macOS CoreGraphics — Platform-native screen capture with byte-order correction.

Keyboard input uses a 76-key lookup table mapping W3C KeyboardEvent.code (physical key positions) to X11 hardware keycodes, avoiding the layout-dependent bugs that come from using keyCode. Mouse events are coalesced on the Mutter backend — stale position events are drained from stdin, keeping only the most recent, to prevent input lag.

X11 auth discovery probes 10 patterns including Mutter Xwayland auth cookies, and when running as root, discovers the display owner from socket file ownership in /tmp/.X11-unix/.

Terminal and commands

Terminal sessions use real PTY allocation via pty.openpty() on Linux with xterm-256color and dynamic resize via SIGWINCH. Role-based access: admins get a full shell, operators get a restricted shell. Windows gets PowerShell in ConstrainedLanguage mode for operators.

42 command types across 3 risk tiers cover everything from service management and log streaming to network diagnostics, package updates, system info, and process control. All cross-platform.

Self-update downloads the new agent.pyz from the server, validates it, atomically replaces via os.replace(), and restarts via systemd.

Security from scratch

WebAuthn

The WebAuthn implementation is 717 lines across the router and database layer. At its core is a hand-built CBOR decoder — 70 lines that parse all 8 CBOR major types including float16/32/64 reinterpretation and variable-length integer encoding. This exists because the attestationObject in WebAuthn registration is CBOR-encoded, and there's no zero-dependency CBOR decoder in the Python ecosystem.

COSE key parsing handles three key types: EC2/P-256 (ES256, the most common), RSA (RS256, Windows Hello), and OKP/Ed25519 (EdDSA, YubiKey 5+). Authenticator data is parsed byte-by-byte — RP ID hash, flags, sign count, AAGUID, credential ID, and the COSE public key. Sign count replay detection rejects assertions where the counter hasn't incremented.

SAML 2.0

835 lines implementing the full OWASP SAML Security Cheat Sheet. Signed AuthnRequests with RSA-SHA256. Certificate validation enforces 2-year maximum lifetime (NIST SP 800-57), minimum key sizes (RSA 2048, ECC 256), and required Key Usage extensions. SHA-1 and MD5 signature algorithms are explicitly rejected. Encrypted assertion decryption supports AES-128/256-CBC and AES-128/256-GCM with RSA-OAEP key transport. Certificate pinning prevents signing key substitution. All XML parsing uses defusedxml.

OWASP hardening

Webhook URLs are resolved and validated against RFC 1918 ranges to prevent SSRF. Shell commands go through an allowlist. Agent service names are validated against a strict regex. Error messages are sanitized — internal details never reach the client. Passwords use PBKDF2 with 600,000 iterations. Secrets are Fernet-encrypted at rest.

The numbers

Subsystem	Lines
Server (Python backend)	46,500
Frontend (Vue 3 + Vite)	33,150
Tests	37,500
Agent	5,000
Shell scripts	7,160

431 API endpoints. 3,567 tests. 49 DB modules. 195 Vue components. 55 remediation actions. 42 agent command types. 12 chaos test scenarios. 6 themes. 3 database backends.

Try it

Docker

git clone https://github.com/raizenica/noba-enterprise.git
cd noba-enterprise
docker compose up -d

Bare metal

curl -fsSL https://raw.githubusercontent.com/raizenica/noba-enterprise/main/install.sh | bash

Free during the open beta. Get NOBA Enterprise · Community Edition · Report an issue

Comments

No comments yet. Be the first.