15.2. 15.3 Scripts, Logs & Alerts#
15.2.1. Building the Core Components#
In this section, we implement the heart of our monitoring system: the scripts that collect data, process it, and trigger alerts.
15.2.1.1. Part 1: Metrics Collector#
The metrics collector runs periodically (via cron or systemd timer) to gather system statistics.
Key Features:
Collects CPU, memory, disk, network metrics
Stores in SQLite database
Handles multi-core systems
Captures context (running processes, open connections)
Graceful error handling if database unavailable
Implementation Strategy:
#!/bin/bash
# metrics-collector.sh - Gather system metrics
source /opt/monitoring-system/lib/logging.sh
source /opt/monitoring-system/lib/db-functions.sh
source /opt/monitoring-system/config/monitoring-system.conf
collect_cpu_metrics() {
# Parse /proc/stat for CPU usage
# Calculate percentage, core count
# Store in database
}
collect_memory_metrics() {
# Read /proc/meminfo
# Calculate available, used, cached
# Detect memory pressure
}
collect_disk_metrics() {
# Use df for all mounted filesystems
# Calculate usage percentage
# Flag critical filesystems
}
collect_network_metrics() {
# Parse /proc/net/dev for interface stats
# Calculate throughput and errors
# Detect failed interfaces
}
main() {
initialize_database
collect_cpu_metrics
collect_memory_metrics
collect_disk_metrics
collect_network_metrics
cleanup_old_metrics
}
main "$@"
15.2.1.2. Part 2: Log Aggregator#
The log aggregator parses system logs and extracts structured events.
Log Sources:
/var/log/auth.log (authentication)
/var/log/syslog (system messages)
/var/log/kern.log (kernel messages)
Application logs (configurable paths)
Remote syslog servers (optional)
Parsing Strategy:
Use regex patterns to extract severity, timestamp, component
Normalize timestamps to UTC
Group related messages (e.g., failed login attempts)
Deduplicate repeated messages
Store in searchable index
Implementation:
parse_syslog_entry() {
local line="$1"
# Extract: timestamp, host, component, severity, message
# Use awk or sed for parsing
# Validate and normalize
}
detect_patterns() {
local message="$1"
# Check against pattern library
# Return severity level
# Return pattern name for alerting
}
aggregate_events() {
# Group similar events by pattern
# Calculate frequency and trend
# Return summary for alerting
}
15.2.1.3. Part 3: Alert Engine#
The alert engine evaluates metrics and events against rules, then triggers notifications.
Alert Rules:
CPU: Alert if > 85% for 5+ minutes
Memory: Alert if > 90% for any sample
Disk: Alert if > 95% on any filesystem
Failed Logins: Alert if 5+ failures in 10 minutes
Errors: Alert if pattern appears 10+ times per hour
Services: Alert if critical service is down
Notification Channels:
send_critical_alert() {
local subject="$1"
local message="$2"
# Send email to admin
mail -s "$subject" admin@example.com <<< "$message"
# Post to Slack
curl -X POST -d @- webhook.slack.com/... <<< "$json_payload"
# Trigger PagerDuty incident
pagerduty_api_call "$subject"
# Log to audit trail
log_alert "CRITICAL" "$subject" "$message"
}
send_warning_alert() {
local subject="$1"
local message="$2"
# Email only
mail -s "$subject" admin@example.com <<< "$message"
log_alert "WARNING" "$subject" "$message"
}
Alert State Management:
Track alert status (new, acknowledged, resolved)
Prevent alert spam (suppress repeated notifications)
Implement escalation (re-alert after N hours if unresolved)
Store in database for history
15.2.1.4. Part 4: Report Generator#
The report generator creates dashboards and exports data.
Report Types:
Real-time Dashboard - Current metrics and alerts
Daily Summary - High-level metrics over 24 hours
Weekly Trend - Performance trends and anomalies
Monthly Capacity - Growth and planning data
Alert Report - Summary of incidents and patterns
Output Formats:
HTML (for web dashboard)
CSV (for spreadsheet analysis)
JSON (for API consumption)
Plain text (for email/syslog)
Report Generation:
generate_html_report() {
# Query database for metric summaries
# Create HTML with embedded charts (using ASCII or gnuplot)
# Add alert history and trends
# Write to /var/www/monitoring/dashboard.html
}
generate_csv_export() {
# Query database for raw metrics
# Format as CSV with headers
# Compress if > 100MB
# Store with timestamp in filename
}
15.2.2. Integration Points#
15.2.2.1. Data Flow#
metrics-collector.sh runs every 1 minute
Gathers metrics → Stores in SQLite
log-aggregator.sh runs every 5 minutes
Parses logs → Extracts events → Updates database
alert-engine.sh runs every 2 minutes
Queries metrics and events
Evaluates rules
Sends notifications if triggered
report-generator.sh runs at scheduled times
6am: Daily report
Every 4 hours: Real-time dashboard refresh
health-check.sh runs every 15 minutes
Validates all components
Restarts failed processes
Alerts if system degraded
15.2.2.2. Error Handling and Resilience#
Component Failures:
Each script runs independently
Graceful degradation if dependencies missing
Fallback to local-only monitoring if central DB down
Retry logic for external services (email, webhooks)
Database Resilience:
Write-ahead logging enabled
Automatic backup before schema changes
Corruption detection and repair
Transaction support for consistency
Log Processing Resilience:
Handle missing or unreadable log files
Process partial entries gracefully
Catch and log parsing errors
Maintain state file for resume-on-restart
15.2.3. Production Considerations#
Performance:
Use efficient SQL queries with proper indexes
Limit historical data retention
Parallel processing where possible
Monitor the monitors (recursive monitoring)
Security:
Restrict file permissions on config files (600)
Sanitize user input in alert messages
Use temporary files securely (mktemp)
Audit trail for all configuration changes
Scalability:
Design for 50+ monitored hosts
Use remote syslog collection
Implement log rotation and archival
Consider clustering for HA (future)
15.2.4. Implementing the Alert Engine#
The alert engine is the intelligence of your system:
#!/bin/bash
# alert-engine.sh - Evaluate metrics and trigger alerts
source lib/logging.sh
METRICS_DB="${METRICS_DB:-/var/lib/myapp/metrics.db}"
ALERT_HISTORY="${ALERT_HISTORY:-/var/lib/myapp/alerts.db}"
# Load configuration with thresholds
source config/thresholds.conf
# Initialize alert database
init_alert_db() {
sqlite3 "$ALERT_HISTORY" << 'EOF'
CREATE TABLE IF NOT EXISTS alerts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp INTEGER,
severity TEXT,
component TEXT,
message TEXT,
acknowledged BOOLEAN DEFAULT 0,
acknowledged_at INTEGER
);
CREATE TABLE IF NOT EXISTS alert_suppression (
component TEXT PRIMARY KEY,
suppressed_until INTEGER
);
EOF
}
# Check if alert should be suppressed (prevent spam)
is_alert_suppressed() {
local component="$1"
local now=$(date +%s)
local suppressed_until=$(sqlite3 "$ALERT_HISTORY" \
"SELECT suppressed_until FROM alert_suppression WHERE component='$component';")
if [[ -n "$suppressed_until" ]] && [[ $suppressed_until -gt $now ]]; then
return 0 # Suppressed
fi
return 1 # Not suppressed
}
# Send alert notification
send_alert() {
local severity="$1"
local component="$2"
local message="$3"
local timestamp=$(date +%s)
# Store in alert database
sqlite3 "$ALERT_HISTORY" \
"INSERT INTO alerts (timestamp, severity, component, message)
VALUES ($timestamp, '$severity', '$component', '$message');"
log_warning "Alert [$severity] $component: $message"
# Send notifications based on severity
case "$severity" in
CRITICAL)
send_email_alert "$component" "$message"
send_slack_alert "danger" "$component" "$message"
;;
WARNING)
send_email_alert "$component" "$message"
;;
INFO)
log_info "$message"
;;
esac
# Suppress duplicate alerts for 1 hour
local suppress_until=$((timestamp + 3600))
sqlite3 "$ALERT_HISTORY" \
"INSERT OR REPLACE INTO alert_suppression (component, suppressed_until)
VALUES ('$component', $suppress_until);"
}
# Evaluate metrics against thresholds
evaluate_alerts() {
log_info "Evaluating alert conditions"
# Get latest metrics
local latest_cpu=$(sqlite3 "$METRICS_DB" \
"SELECT usage_percent FROM cpu_metrics ORDER BY timestamp DESC LIMIT 1;")
local latest_memory=$(sqlite3 "$METRICS_DB" \
"SELECT usage_percent FROM memory_metrics ORDER BY timestamp DESC LIMIT 1;")
# Check CPU threshold
if (( $(echo "$latest_cpu > $CPU_THRESHOLD" | bc -l) )); then
if ! is_alert_suppressed "cpu_high"; then
send_alert "WARNING" "cpu_high" "CPU usage is ${latest_cpu}% (threshold: ${CPU_THRESHOLD}%)"
fi
fi
# Check memory threshold
if (( $(echo "$latest_memory > $MEMORY_THRESHOLD" | bc -l) )); then
if ! is_alert_suppressed "memory_high"; then
send_alert "WARNING" "memory_high" "Memory usage is ${latest_memory}% (threshold: ${MEMORY_THRESHOLD}%)"
fi
fi
}
main() {
init_alert_db
evaluate_alerts
}
main "$@"
15.2.5. Testing Integration Points#
#!/bin/bash
# integration-test.sh - Test component interactions
# Create test database
TEST_DB="/tmp/test-metrics-$$.db"
setup_test_env() {
sqlite3 "$TEST_DB" << 'EOF'
CREATE TABLE cpu_metrics (
timestamp INTEGER PRIMARY KEY,
usage_percent REAL
);
INSERT INTO cpu_metrics VALUES ($(date +%s), 88.5);
EOF
}
test_alert_on_high_cpu() {
# Run alert engine with test data
METRICS_DB="$TEST_DB" CPU_THRESHOLD=85 ./src/alert-engine.sh
# Verify alert was created
local alert_count=$(sqlite3 "$TEST_DB" \
"SELECT COUNT(*) FROM alerts WHERE severity='WARNING' AND component='cpu_high';")
if [[ "$alert_count" -gt 0 ]]; then
echo "✓ Test: Alert triggered for high CPU"
else
echo "✗ Test: Alert NOT triggered for high CPU"
return 1
fi
}
cleanup() {
rm -f "$TEST_DB"
}
setup_test_env
test_alert_on_high_cpu
cleanup
#!/bin/bash
# Example: Implementing a simple metrics collector
source lib/logging.sh
source lib/validation.sh
# Configuration
METRICS_DB="${METRICS_DB:-/var/lib/myapp/metrics.db}"
COLLECTION_INTERVAL="${COLLECTION_INTERVAL:-60}"
# Ensure database exists
init_database() {
if [[ ! -f "$METRICS_DB" ]]; then
log_info "Initializing metrics database"
sqlite3 "$METRICS_DB" << 'EOF'
CREATE TABLE IF NOT EXISTS cpu_metrics (
timestamp INTEGER PRIMARY KEY,
usage_percent REAL,
user_time INTEGER,
system_time INTEGER
);
CREATE TABLE IF NOT EXISTS memory_metrics (
timestamp INTEGER PRIMARY KEY,
total_mb INTEGER,
used_mb INTEGER,
free_mb INTEGER,
usage_percent REAL
);
CREATE INDEX idx_cpu_timestamp ON cpu_metrics(timestamp);
CREATE INDEX idx_memory_timestamp ON memory_metrics(timestamp);
EOF
fi
}
# Collect CPU metrics
collect_cpu_metrics() {
local timestamp=$(date +%s)
# Read /proc/stat (simplified)
local cpu_info=$(grep "^cpu " /proc/stat)
# Parse values (user, nice, system, idle)
local user=$(echo "$cpu_info" | awk '{print $2}')
local system=$(echo "$cpu_info" | awk '{print $4}')
local idle=$(echo "$cpu_info" | awk '{print $5}')
local total=$((user + system + idle))
local usage=$(awk "BEGIN {printf \"%.1f\", (($user + $system) / $total) * 100}")
# Store in database
sqlite3 "$METRICS_DB" \
"INSERT INTO cpu_metrics VALUES ($timestamp, $usage, $user, $system);"
log_info "CPU metric recorded: ${usage}%"
}
# Collect memory metrics
collect_memory_metrics() {
local timestamp=$(date +%s)
# Read /proc/meminfo
local meminfo=$(cat /proc/meminfo)
local total=$(echo "$meminfo" | grep "MemTotal" | awk '{print int($2/1024)}')
local free=$(echo "$meminfo" | grep "MemFree" | awk '{print int($2/1024)}')
local used=$((total - free))
local usage=$(awk "BEGIN {printf \"%.1f\", ($used / $total) * 100}")
# Store in database
sqlite3 "$METRICS_DB" \
"INSERT INTO memory_metrics VALUES ($timestamp, $total, $used, $free, $usage);"
log_info "Memory metric recorded: ${usage}% ($used/${total}MB)"
}
# Main collection function
main() {
log_info "Starting metrics collection"
init_database
collect_cpu_metrics
collect_memory_metrics
log_info "Metrics collection complete"
}
main "$@"
Cell In[1], line 4
source lib/logging.sh
^
SyntaxError: invalid syntax
15.2.6. Modular Library Design#
Extract reusable functions into libraries to avoid code duplication:
15.2.6.1. lib/logging.sh#
#!/bin/bash
# Logging utility functions
LOG_FILE="${LOG_FILE:-/var/log/myapp/app.log}"
LOG_LEVEL="${LOG_LEVEL:-INFO}"
DEBUG_MODE="${DEBUG_MODE:-0}"
# Log levels: DEBUG=0, INFO=1, WARNING=2, ERROR=3
declare -A LOG_LEVELS=([DEBUG]=0 [INFO]=1 [WARNING]=2 [ERROR]=3)
log() {
local level="$1"
shift
local message="$@"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
# Check if message should be logged based on level
if [[ "${LOG_LEVELS[$level]}" -ge "${LOG_LEVELS[$LOG_LEVEL]}" ]]; then
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
# Also print to stderr if DEBUG mode
if [[ "$DEBUG_MODE" == "1" ]] && [[ "$level" != "DEBUG" ]]; then
echo "[$timestamp] [$level] $message" >&2
fi
fi
}
log_debug() { [[ "$DEBUG_MODE" == "1" ]] && log "DEBUG" "$@"; }
log_info() { log "INFO" "$@"; }
log_warning() { log "WARNING" "$@"; }
log_error() { log "ERROR" "$@"; }
log_function_entry() {
local func="$1"
[[ "$DEBUG_MODE" == "1" ]] && log_debug "→ Entering $func with args: $*"
}
log_function_exit() {
local func="$1"
local result="$2"
[[ "$DEBUG_MODE" == "1" ]] && log_debug "← Exiting $func (result: $result)"
}
15.2.6.2. lib/validation.sh#
#!/bin/bash
# Input validation functions
validate_not_empty() {
local var_name="$1"
local var_value="$2"
if [[ -z "$var_value" ]]; then
log_error "$var_name cannot be empty"
return 1
fi
}
validate_directory() {
local dir="$1"
if [[ ! -d "$dir" ]]; then
log_error "Directory not found: $dir"
return 1
fi
if [[ ! -w "$dir" ]]; then
log_error "Directory not writable: $dir"
return 1
fi
}
validate_file_readable() {
local file="$1"
if [[ ! -r "$file" ]]; then
log_error "File not readable: $file"
return 1
fi
}
validate_command_exists() {
local cmd="$1"
if ! command -v "$cmd" &> /dev/null; then
log_error "Required command not found: $cmd"
return 1
fi
}
validate_integer() {
local var_name="$1"
local var_value="$2"
if ! [[ "$var_value" =~ ^[0-9]+$ ]]; then
log_error "$var_name must be an integer, got: $var_value"
return 1
fi
}
15.2.6.3. lib/arrays.sh#
#!/bin/bash
# Array utility functions
# Check if array contains element
array_contains() {
local -n arr=$1
local value="$2"
for element in "${arr[@]}"; do
[[ "$element" == "$value" ]] && return 0
done
return 1
}
# Get array length
array_length() {
local -n arr=$1
echo "${#arr[@]}"
}
# Filter array based on condition
array_filter() {
local -n arr=$1
local pattern="$2"
local -a result=()
for element in "${arr[@]}"; do
if [[ "$element" =~ $pattern ]]; then
result+=("$element")
fi
done
printf '%s\n' "${result[@]}"
}
# Join array elements
array_join() {
local separator="$1"
shift
local -a arr=("$@")
local result=""
for ((i=0; i<${#arr[@]}; i++)); do
result+="${arr[$i]}"
[[ $i -lt $((${#arr[@]}-1)) ]] && result+="$separator"
done
echo "$result"
}
# Sort array
array_sort() {
local -n arr=$1
mapfile -t arr < <(printf '%s\n' "${arr[@]}" | sort)
}
15.2.7. Data Flow and Integration#
Your main script orchestrates the flow:
#!/bin/bash
# main.sh - Orchestrates the monitoring system
source lib/logging.sh
source lib/validation.sh
source lib/arrays.sh
# Load configuration
source config/app.conf
# Initialize
initialize_system() {
log_info "Starting system initialization"
validate_directory "$LOG_PATH"
validate_directory "$DATA_PATH"
validate_command_exists "sqlite3"
log_info "System initialized successfully"
}
# Main execution
main() {
initialize_system
log_info "Collecting metrics..."
./src/metrics-collector.sh
log_info "Aggregating logs..."
./src/log-aggregator.sh
log_info "Evaluating alerts..."
./src/alert-engine.sh
log_info "Generating reports..."
./src/report-generator.sh
log_info "Cycle complete"
}
main "$@"