Problem
I wanted a real time view of multiple machines, CPU spikes, disk pressure, and runaway processes without standing up something heavyweight like Prometheus/Grafana or paying for a SaaS tool. I also wanted the system to actually do something when things went wrong, not just alert me.
What I Built
System Autopilot is a distributed monitoring platform with two main components:
- Central control plane — a Spring Boot REST API backed by PostgreSQL that receives metric snapshots, stores history, evaluates alert rules, and exposes a web dashboard.
- Device agents — lightweight Java processes running on each monitored machine that collect system metrics, POST them to the control plane, and poll for remote commands.
The agents run without any direct connection between them; all communication flows through the control plane's REST API.
Architecture
┌─────────────────────────────────────────┐
│ Central Control Plane │
│ │
│ Spring Boot REST API → PostgreSQL │
│ Alert Engine │
│ Web Dashboard (JS + Fetch API) │
└──────────┬──────────────────────────────┘
│ HTTP
┌──────┴──────┐
│ │
Agent 1 Agent N
(metrics) (metrics)
CPU / Mem / CPU / Mem /
Disk / Procs Disk / Procs
Each agent runs a fixed-rate scheduler that:
- Collects CPU, memory, disk, and process-level metrics
- Packages them as a JSON snapshot
- POSTs to
/api/agents/{id}/snapshot - GETs
/api/agents/{id}/commandsto check for pending remote actions
Key Design Decisions
REST over WebSockets for agent communication. Agents poll the control plane rather than maintaining a persistent connection. This makes agents simpler (no reconnect logic) and the control plane stateless with respect to agent connections — any agent can be restarted or scaled independently.
PostgreSQL with time-series indexing. Metric history is stored in PostgreSQL rather than a dedicated TSDB. For the scale I'm targeting (~300 metrics/min across 10 agents), a well-indexed relational table outperforms the operational overhead of a separate time-series database. The schema partitions on agent_id and collected_at for efficient range scans.
Autonomous process management. Agents maintain a configurable blocklist of process names. When a blocked process is detected, or a process exceeds a CPU/memory threshold, the agent can autonomously send a SIGTERM without waiting for the control plane to issue a command. This keeps response latency sub-second for critical events even if the network is degraded.
Results & Metrics
- ~300 metrics/min ingested across 10 agents during sustained testing
- ~0.5s ingest latency from collection on the agent to storage in PostgreSQL
- Full REST API coverage: metrics history, alert configuration, remote commands, agent registration
- Web dashboard with per-agent sparklines, alert log, and manual command execution
Challenges
Schema evolution without downtime. Adding new metric types (disk I/O, network counters) required schema migrations while agents were actively writing. I solved this with additive migrations only and nullable columns, so older agent versions continue to work against the new schema.
Agent heartbeat vs. metric freshness. The control plane needs to distinguish between an agent that's healthy but quiet and one that's genuinely offline. I implemented a separate heartbeat endpoint that agents call every 5 seconds, independent of the metric collection interval. Alerts fire if the last heartbeat is older than 15 seconds.
Lessons Learned
Building your own monitoring system is a great way to develop opinions about schema design and API ergonomics. The biggest insight was how much work the polling model saves — it eliminates a whole class of connection management bugs at the cost of some latency that's acceptable for this use case.
What's Next
- API key + JWT authentication for the control plane
- HTTPS everywhere (currently HTTP for development)
- Role-based access control (read-only vs. admin)
- Docker Compose deployment for one-command setup
- Metric aggregation to prevent unbounded storage growth