Comprehensive monitoring and observability strategy providing real-time insights, historical analysis, and proactive alerting across all homelab infrastructure and services.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Monitoring Architecture β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β lucille4 β β lucille5 β β nas02 β β
β β (Agent) β β (Agent) β β (Agent) β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β β β
β ββββββββΌβββββββ β
β β loose-seal β β
β β (Hub/Logs) β β
β β β β
β β β’ Beszel β β
β β β’ Seq β β
β β β’ Grafana β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Purpose: Real-time system monitoring and alerting
## System Resources
- CPU usage (%)
- Memory usage (GB/%)
- Disk usage (GB/%)
- Network I/O (MB/s)
- Load average
## Container Health
- Running containers
- Container resource usage
- Service availability
- Docker daemon status
## Custom Metrics
- Temperature monitoring
- Power consumption
- Service-specific metrics
Purpose: Centralized structured logging and analysis
## Application Logs
- Docker containers (via GELF driver)
- Nginx/Caddy access logs
- Application-specific logs
- System journal (systemd)
## Security Logs
- Authentication events
- Failed login attempts
- Admin actions
- Certificate renewals
## Infrastructure Logs
- Ansible playbook execution
- Backup job status
- Update installations
- Configuration changes
-- Failed login attempts in last hour
@Timestamp > Now() - 1h and @Level = 'Warning' and @Message like '%failed%'
-- High memory usage events
@Fields.memory_percent > 90 and @Timestamp > Now() - 24h
-- Container restart events
@Message like '%container%' and @Message like '%restart%'
-- Certificate expiration warnings
@Level = 'Warning' and @Message like '%certificate%' and @Message like '%expir%'
Purpose: Advanced visualization and historical analysis
## Infrastructure Health
- System Uptime: >99.5% monthly
- Service Availability: >99% for critical services
- Response Time: <500ms for web applications
- Disk Usage: <80% on all systems
## Resource Utilization
- CPU Usage: <70% average
- Memory Usage: <80% average
- Network Utilization: <50% of available bandwidth
- Storage Growth: <5GB/month increase
## Service Performance
- Jellyfin Streaming: 0% buffering events
- Home Assistant Response: <200ms
- Authentication Speed: <1s login time
- Backup Success Rate: 100% completion
## Service Accessibility
- WiFi Connectivity: 100% home coverage
- Service Response: <3s page load times
- Mobile App Performance: <2s startup
- Voice Assistant Response: <1s
## Content and Media
- Media Library Growth: Monthly additions tracked
- Recipe Database: Usage and additions
- Document Processing: OCR accuracy >95%
- Smart Home Response: <500ms automation triggers
## Home Assistant Mobile App
- Service outages affecting daily use
- Security events (door/camera alerts)
- Internet connectivity issues
- Smart home automation failures
## Discord Family Server
- Weekly system health summary
- New service deployments
- Planned maintenance notifications
- Achievement milestones (uptime, etc.)
## Email Alerts
- Critical system failures
- Security incidents
- Backup failures
- Certificate expiration warnings
## SMS/Text (High Priority)
- Complete system outages
- Security breaches
- Data loss events
- Infrastructure failures
## Slack/Discord (Technical)
- Resource utilization warnings
- Performance degradation
- Update notifications
- Automation results
## System Resource Alerts
- CPU > 80% for 5 minutes
- Memory > 90% for 3 minutes
- Disk > 90% usage
- Service unavailable for 1 minute
## Network Alerts
- High network errors (>5% packet loss)
- Unusual bandwidth usage (>80% capacity)
- Connection timeout to external services
-- Critical Error Rate
@Level = 'Error' and @Timestamp > Now() - 5m
group by @Source having count(*) > 10
-- Authentication Failures
@Message like '%authentication%' and @Message like '%failed%'
and @Timestamp > Now() - 1h
group by @Fields.source_ip having count(*) > 5
-- Service Restart Detection
@Message like '%container%' and (@Message like '%restart%' or @Message like '%stopped%')
and @Timestamp > Now() - 5m
## Deploy Beszel hub on loose-seal
ansible-playbook -i inventory.yml deploy-beszel-hub.yml
## Install agents on all servers
ansible-playbook -i inventory.yml deploy-beszel-agents.yml
## Setup Seq logging
ansible-playbook -i inventory.yml deploy-seq.yml
## Configure Grafana
ansible-playbook -i inventory.yml deploy-grafana.yml
## Docker compose logging configuration
logging:
driver: gelf
options:
gelf-address: udp://seq.speicher.family:12201
tag: "{{.ContainerName}}"
## Import Grafana dashboards
curl -X POST http://grafana.speicher.family/api/dashboards/db \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-d @dashboard-system-overview.json
## Configure Beszel monitoring
## Access https://beszel.speicher.family
## Add agents and configure thresholds
#!/bin/bash
## Weekly monitoring maintenance
## Check Seq log retention
curl -X GET "https://seq.speicher.family/api/events?count=1" | jq '.TotalEvents'
## Verify Beszel agent connectivity
curl -X GET "https://beszel.speicher.family/api/agents" | jq '.agents[].status'
## Clean old Grafana snapshots
curl -X DELETE "https://grafana.speicher.family/api/snapshots/old"
## Review alert accuracy and adjust thresholds
echo "Review false positives and missed alerts"
#!/bin/bash
## Monthly monitoring review
## Generate system health report
./scripts/generate-health-report.sh
## Review and optimize alert rules
## Check alert fatigue and effectiveness
## Update monitoring dashboards
## Add new services and metrics
## Capacity planning analysis
## Review growth trends and resource projections
## Check monitoring system performance
docker stats beszel seq grafana
## Review data retention settings
## Adjust collection intervals
## Optimize dashboard queries
## Verify agent connectivity
curl -X GET "https://beszel.speicher.family/api/agents"
## Check log forwarding
docker logs container-name | grep gelf
## Test network connectivity
ping seq.speicher.family
## Review alert frequency
- Increase thresholds for noisy alerts
- Add time-based conditions (business hours)
- Implement alert grouping and deduplication
- Create escalation policies