Back to Projects

VM Monitor

Infrastructure Monitoring Platform

VM Monitor Dashboard

Overview

Infrastructure monitoring platform for managing Oracle Cloud VMs without SSH. Three components: Go agent on each VM, control plane API for polling and storage, Next.js dashboard for operations.

Needed to manage multiple VMs running MySpendo, Weather Insight, and other apps. SSH workflow was tedious: check status, read logs, edit .env, restart. Multiply that across two VMs with 3-4 apps each. Built this to do all of that from the browser.

3
Components
Real-time
SSE streaming
Demo
Mode included

Architecture

VM Monitor Architecture Diagram

Browser hits Next.js API routes that proxy to control plane. API key stays server-side. Never in browser bundle.

Control plane polls agents every 30 seconds over private Oracle VCN network. Agents listen on localhost:9000. Only accessible to control plane's private IP. No public exposure.

Agents are static Go binaries. Zero runtime dependencies. Read systemd status, Docker containers, journald logs, CPU/memory from /proc, and parse .env files.

Control plane stores history in PostgreSQL. Fires webhook alerts on crashes. Proxies requests to agents. Tracks 30-day uptime per app.

Tech Stack

Agent

Go 1.24chi routerStatic binary

Control Plane

Go 1.24chi routerpgx

Database

Neon PostgreSQLJSONB config storage

Dashboard

Next.js 16React 19TypeScriptTailwind v4

CI/CD

GitHub ActionsAutomated releasesMulti-arch builds

Deployment

Oracle Cloud (2 VMs)Vercelsystemd services

Technical Challenges

SSE Streaming with Demo Mode

Problem

Log streaming uses Server-Sent Events from journalctl -f. SSE needs long-lived connections to real agents. Demo mode has no agents. Cannot stream fake data over SSE.

Solution

Return 204 in demo mode. EventSource errors immediately. Log viewer detects failure and falls back to HTTP polling (5s interval). Polling hits /logs route that returns demo data. No client changes needed.

Impact

Demo works perfectly. Users browse live demo without infrastructure. SSE and polling look identical to users.

Atomic .env File Writes

Problem

Editing .env on running services is risky. Partial write or crash mid-update corrupts config. App breaks on next restart.

Solution

Write sequence: backup original → write to .env.tmp → mv .env.tmp .env. The mv is atomic (rename syscall on same filesystem). No partial write window. Crash leaves original .env untouched.

Impact

Zero partial writes. Config changes safe. Backup always available if something breaks.

Status Transition Detection

Problem

Poller fetches status from agent, calls UpdateStatus (writes DB), then fetches app to check changes. Cannot compare old vs new. Needed for alerts and uptime history.

Solution

Fetch app BEFORE UpdateStatus. Capture oldStatus. Compare oldStatus != newStatus after update. Fire webhook only on transition. Write status_history row (close old, open new).

Impact

Alerts fire once per status change. Uptime history accurate. No duplicate notifications.

Key Features

Real-time status polling every 30 seconds
Live log streaming via SSE with cursor fallback
Environment editor with diff preview and atomic writes
One-click restart with confirmation
Deploy via git pull from dashboard
Webhook alerts on crash/recovery (Slack + generic JSON)
Auto-restart with flap protection (max 3 per 10 minutes)
CPU and memory metrics per app
VM-level system metrics (memory, load, disk, uptime)
30-day uptime history with incident timeline
Full audit log (env changes, restarts, deploys)
Demo mode for public sharing
One-liner agent installer
Multi-arch releases (amd64/arm64)

What I Learned

Go Static Binaries: Zero runtime dependencies. Just scp and run. No Node, Python, or JVM. Learned atomic file operations (rename, temp files, backups).

SSE vs WebSocket: SSE simpler for one-way streams. journalctl -f pipes perfectly. Always need fallback (HTTP polling) for edge cases.

Next.js API Proxy Pattern: Keeps API keys server-side. Never in browser bundle. Makes demo mode easy to guard in one place.

Private Networking: Agents don't need public IPs. Oracle VCN handles private routing. Control plane uses 10.0.0.x addresses. Only control plane needs public endpoint with TLS.

JSONB Flexibility: App config stored as JSONB. Added new features (deploy_dir, auto_restart) without migrations. Just new JSON keys. Structs sync by convention.

What I'd Do Differently: Integration tests for poller and agent handlers. Job queue like Asynq instead of goroutines. Prometheus metrics. Mobile app instead of mobile web.