Automated System Health Monitoring with n8n — Save Reports to S3 and Alert on Slack
In this guide you’ll learn how to build a lightweight, reliable system health monitoring workflow in n8n that:
- Runs periodic health checks
- Collects system metrics and service status
- Stores full JSON snapshots in S3
- Sends Slack alerts for normal and critical states
Why this approach?
This pattern is ideal for small teams and self-hosted infrastructure where you want a simple, auditable trail of system snapshots plus fast Slack notifications when something goes wrong. Offloading full reports to S3 keeps Slack messages concise while preserving rich data for forensic analysis or compliance.
Overview of the n8n workflow
The workflow (visualized above) follows a straightforward pipeline:
- Schedule trigger — run every 5 minutes.
- Collect metrics — Node.js function that gathers CPU, memory, disk, and service state.
- Convert to binary — prepare the JSON snapshot as a base64-encoded file to upload.
- Upload to S3 — write the snapshot to an S3 bucket with tags and a timestamped filename.
- Check for alerts — if alerts exist, prepare a critical Slack message; otherwise, send a normal notification.
- Backup & error handling — store problematic snapshots in an alerts bucket and notify on-call when uploads fail.
Key nodes explained
1. schedule-health-check
Frequency: every 5 minutes (configurable). This node triggers the workflow on a set interval. Adjust depending on how frequently you need visibility vs cost and API rate concerns.
2. collect-system-metrics (Function)
This node runs JavaScript (Node.js) and uses the os
module to gather host details and memory info. The provided template simulates CPU and disk usage; in production you can replace those simulations with calls to system utilities or APIs (e.g., iostat
, vmstat
, or an agent).
// simplified example
const os = require('os');
const timestamp = new Date().toISOString();
const cpuUsage = /* sample or real measure */;
const memoryTotal = os.totalmem();
const memoryFree = os.freemem();
const healthData = {
timestamp,
hostname: os.hostname(),
metrics: { /* cpu, memory, disk */ },
services: [ /* svc name + status */ ],
alerts: []
};
// threshold checks -> push to alerts
3. convert-to-binary
n8n requires binary data to upload files to S3. This function node base64-encodes the JSON snapshot and sets a proper fileName and mimeType (application/json
).
4. upload-to-s3
Upload the snapshot to a production S3 bucket (example name system-health-snapshots
). Use tags for environment and type so lifecycle rules and searches are easy. Set an IAM user or role in n8n with a least-privilege policy limited to the target buckets.
5. check-for-alerts
A simple If node branches on whether any alerts were accumulated. When alerts exist, the workflow prepares a more urgent Slack message (red attachment), and when everything is nominal it sends a green “All systems nominal” message to a logging channel.
Slack message design
Keep Slack messages short and actionable. The workflow uses attachments with:
- Title with hostname and timestamp
- One-line metrics summary (CPU / Memory / Disk)
- Service states list
- Bullet list of critical alerts (if any)
- Link to full JSON report in S3
Example structure produced by the function node:
{
text: '🚨 Critical System Health Alert',
attachments: [
{ color: 'danger', title: 'Health Check - my-host', text: '*Metrics:* \nCPU: 92% ...', footer: 'System Health Monitor' }
]
}
Storage, retention, and cost considerations
- Apply S3 lifecycle policies: move snapshots older than 30–90 days to Glacier or delete after retention period.
- Tag snapshots by environment and type to support analytics and billing.
- Limit frequency and sample size to balance observability with storage cost.
Security and IAM best practices
- Create an IAM user or role for n8n with scoped S3 permissions (PutObject, PutObjectTagging for the specific buckets).
- Store Slack tokens and AWS credentials in n8n credentials store (not plaintext in nodes).
- Restrict S3 bucket policies to accept uploads only from known VPC endpoints or IPs if possible.
Resilience and error handling
The template already includes a backup-to-alerts-bucket and on-call notification if something in the upload branch fails. Consider adding:
- Retries with exponential backoff for S3/Slack failures
- Deduplication logic to avoid alert storms (e.g., only alert once per incident window)
- Rate limiting and aggregation for Slack notifications
How to customize and extend
Ideas for productionizing this workflow:
- Replace simulated CPU/disk values with real measurements (shell execution node calling
top
,df
, or a small agent that exposes a JSON endpoint). - Integrate with PagerDuty or Opsgenie for incident escalation.
- Aggregate metrics to a TSDB like Prometheus and use n8n only for snapshotting and alert fanout.
- Add historical alerts dashboard by querying S3 snapshots or feeding events to an ELK stack.
Security checklist before enabling
- Verify IAM policies only allow intended buckets and keys.
- Ensure sensitive fields are not leaked in Slack messages (provide link to S3 report instead).
- Confirm Slack channel permissions and token scopes.
Wrap-up and next steps
This n8n workflow gives you a lightweight, auditable system health pipeline: regular snapshots to S3 for long-term analysis and concise Slack alerts for fast remediation. It’s an excellent foundation that you can extend to meet enterprise needs.
Call to action: Clone this workflow into your n8n instance, update the AWS and Slack credentials, set your desired thresholds, and enable the schedule. If you’d like, export your current workflow and I can suggest specific improvements (alert dedupe, real CPU measurement, or PagerDuty integration).
Want help customizing thresholds or adding Prometheus integration? Reply with your environment details and I’ll provide step-by-step changes.