15.2. 15.3 Scripts, Logs & Alerts#

15.2.1. Building the Core Components#

In this section, we implement the heart of our monitoring system: the scripts that collect data, process it, and trigger alerts.

15.2.1.1. Part 1: Metrics Collector#

The metrics collector runs periodically (via cron or systemd timer) to gather system statistics.

Key Features:

  • Collects CPU, memory, disk, network metrics

  • Stores in SQLite database

  • Handles multi-core systems

  • Captures context (running processes, open connections)

  • Graceful error handling if database unavailable

Implementation Strategy:

#!/bin/bash
# metrics-collector.sh - Gather system metrics

source /opt/monitoring-system/lib/logging.sh
source /opt/monitoring-system/lib/db-functions.sh
source /opt/monitoring-system/config/monitoring-system.conf

collect_cpu_metrics() {
  # Parse /proc/stat for CPU usage
  # Calculate percentage, core count
  # Store in database
}

collect_memory_metrics() {
  # Read /proc/meminfo
  # Calculate available, used, cached
  # Detect memory pressure
}

collect_disk_metrics() {
  # Use df for all mounted filesystems
  # Calculate usage percentage
  # Flag critical filesystems
}

collect_network_metrics() {
  # Parse /proc/net/dev for interface stats
  # Calculate throughput and errors
  # Detect failed interfaces
}

main() {
  initialize_database
  collect_cpu_metrics
  collect_memory_metrics
  collect_disk_metrics
  collect_network_metrics
  cleanup_old_metrics
}

main "$@"

15.2.1.2. Part 2: Log Aggregator#

The log aggregator parses system logs and extracts structured events.

Log Sources:

  • /var/log/auth.log (authentication)

  • /var/log/syslog (system messages)

  • /var/log/kern.log (kernel messages)

  • Application logs (configurable paths)

  • Remote syslog servers (optional)

Parsing Strategy:

  • Use regex patterns to extract severity, timestamp, component

  • Normalize timestamps to UTC

  • Group related messages (e.g., failed login attempts)

  • Deduplicate repeated messages

  • Store in searchable index

Implementation:

parse_syslog_entry() {
  local line="$1"
  # Extract: timestamp, host, component, severity, message
  # Use awk or sed for parsing
  # Validate and normalize
}

detect_patterns() {
  local message="$1"
  # Check against pattern library
  # Return severity level
  # Return pattern name for alerting
}

aggregate_events() {
  # Group similar events by pattern
  # Calculate frequency and trend
  # Return summary for alerting
}

15.2.1.3. Part 3: Alert Engine#

The alert engine evaluates metrics and events against rules, then triggers notifications.

Alert Rules:

  • CPU: Alert if > 85% for 5+ minutes

  • Memory: Alert if > 90% for any sample

  • Disk: Alert if > 95% on any filesystem

  • Failed Logins: Alert if 5+ failures in 10 minutes

  • Errors: Alert if pattern appears 10+ times per hour

  • Services: Alert if critical service is down

Notification Channels:

send_critical_alert() {
  local subject="$1"
  local message="$2"
  
  # Send email to admin
  mail -s "$subject" admin@example.com <<< "$message"
  
  # Post to Slack
  curl -X POST -d @- webhook.slack.com/... <<< "$json_payload"
  
  # Trigger PagerDuty incident
  pagerduty_api_call "$subject"
  
  # Log to audit trail
  log_alert "CRITICAL" "$subject" "$message"
}

send_warning_alert() {
  local subject="$1"
  local message="$2"
  
  # Email only
  mail -s "$subject" admin@example.com <<< "$message"
  log_alert "WARNING" "$subject" "$message"
}

Alert State Management:

  • Track alert status (new, acknowledged, resolved)

  • Prevent alert spam (suppress repeated notifications)

  • Implement escalation (re-alert after N hours if unresolved)

  • Store in database for history

15.2.1.4. Part 4: Report Generator#

The report generator creates dashboards and exports data.

Report Types:

  1. Real-time Dashboard - Current metrics and alerts

  2. Daily Summary - High-level metrics over 24 hours

  3. Weekly Trend - Performance trends and anomalies

  4. Monthly Capacity - Growth and planning data

  5. Alert Report - Summary of incidents and patterns

Output Formats:

  • HTML (for web dashboard)

  • CSV (for spreadsheet analysis)

  • JSON (for API consumption)

  • Plain text (for email/syslog)

Report Generation:

generate_html_report() {
  # Query database for metric summaries
  # Create HTML with embedded charts (using ASCII or gnuplot)
  # Add alert history and trends
  # Write to /var/www/monitoring/dashboard.html
}

generate_csv_export() {
  # Query database for raw metrics
  # Format as CSV with headers
  # Compress if > 100MB
  # Store with timestamp in filename
}

15.2.2. Integration Points#

15.2.2.1. Data Flow#

  1. metrics-collector.sh runs every 1 minute

    • Gathers metrics → Stores in SQLite

  2. log-aggregator.sh runs every 5 minutes

    • Parses logs → Extracts events → Updates database

  3. alert-engine.sh runs every 2 minutes

    • Queries metrics and events

    • Evaluates rules

    • Sends notifications if triggered

  4. report-generator.sh runs at scheduled times

    • 6am: Daily report

    • Every 4 hours: Real-time dashboard refresh

  5. health-check.sh runs every 15 minutes

    • Validates all components

    • Restarts failed processes

    • Alerts if system degraded

15.2.2.2. Error Handling and Resilience#

Component Failures:

  • Each script runs independently

  • Graceful degradation if dependencies missing

  • Fallback to local-only monitoring if central DB down

  • Retry logic for external services (email, webhooks)

Database Resilience:

  • Write-ahead logging enabled

  • Automatic backup before schema changes

  • Corruption detection and repair

  • Transaction support for consistency

Log Processing Resilience:

  • Handle missing or unreadable log files

  • Process partial entries gracefully

  • Catch and log parsing errors

  • Maintain state file for resume-on-restart

15.2.3. Production Considerations#

Performance:

  • Use efficient SQL queries with proper indexes

  • Limit historical data retention

  • Parallel processing where possible

  • Monitor the monitors (recursive monitoring)

Security:

  • Restrict file permissions on config files (600)

  • Sanitize user input in alert messages

  • Use temporary files securely (mktemp)

  • Audit trail for all configuration changes

Scalability:

  • Design for 50+ monitored hosts

  • Use remote syslog collection

  • Implement log rotation and archival

  • Consider clustering for HA (future)

15.2.4. Implementing the Alert Engine#

The alert engine is the intelligence of your system:

#!/bin/bash
# alert-engine.sh - Evaluate metrics and trigger alerts

source lib/logging.sh

METRICS_DB="${METRICS_DB:-/var/lib/myapp/metrics.db}"
ALERT_HISTORY="${ALERT_HISTORY:-/var/lib/myapp/alerts.db}"

# Load configuration with thresholds
source config/thresholds.conf

# Initialize alert database
init_alert_db() {
  sqlite3 "$ALERT_HISTORY" << 'EOF'
CREATE TABLE IF NOT EXISTS alerts (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  timestamp INTEGER,
  severity TEXT,
  component TEXT,
  message TEXT,
  acknowledged BOOLEAN DEFAULT 0,
  acknowledged_at INTEGER
);

CREATE TABLE IF NOT EXISTS alert_suppression (
  component TEXT PRIMARY KEY,
  suppressed_until INTEGER
);
EOF
}

# Check if alert should be suppressed (prevent spam)
is_alert_suppressed() {
  local component="$1"
  local now=$(date +%s)
  
  local suppressed_until=$(sqlite3 "$ALERT_HISTORY" \
    "SELECT suppressed_until FROM alert_suppression WHERE component='$component';")
  
  if [[ -n "$suppressed_until" ]] && [[ $suppressed_until -gt $now ]]; then
    return 0  # Suppressed
  fi
  return 1  # Not suppressed
}

# Send alert notification
send_alert() {
  local severity="$1"
  local component="$2"
  local message="$3"
  local timestamp=$(date +%s)
  
  # Store in alert database
  sqlite3 "$ALERT_HISTORY" \
    "INSERT INTO alerts (timestamp, severity, component, message) 
     VALUES ($timestamp, '$severity', '$component', '$message');"
  
  log_warning "Alert [$severity] $component: $message"
  
  # Send notifications based on severity
  case "$severity" in
    CRITICAL)
      send_email_alert "$component" "$message"
      send_slack_alert "danger" "$component" "$message"
      ;;
    WARNING)
      send_email_alert "$component" "$message"
      ;;
    INFO)
      log_info "$message"
      ;;
  esac
  
  # Suppress duplicate alerts for 1 hour
  local suppress_until=$((timestamp + 3600))
  sqlite3 "$ALERT_HISTORY" \
    "INSERT OR REPLACE INTO alert_suppression (component, suppressed_until) 
     VALUES ('$component', $suppress_until);"
}

# Evaluate metrics against thresholds
evaluate_alerts() {
  log_info "Evaluating alert conditions"
  
  # Get latest metrics
  local latest_cpu=$(sqlite3 "$METRICS_DB" \
    "SELECT usage_percent FROM cpu_metrics ORDER BY timestamp DESC LIMIT 1;")
  
  local latest_memory=$(sqlite3 "$METRICS_DB" \
    "SELECT usage_percent FROM memory_metrics ORDER BY timestamp DESC LIMIT 1;")
  
  # Check CPU threshold
  if (( $(echo "$latest_cpu > $CPU_THRESHOLD" | bc -l) )); then
    if ! is_alert_suppressed "cpu_high"; then
      send_alert "WARNING" "cpu_high" "CPU usage is ${latest_cpu}% (threshold: ${CPU_THRESHOLD}%)"
    fi
  fi
  
  # Check memory threshold
  if (( $(echo "$latest_memory > $MEMORY_THRESHOLD" | bc -l) )); then
    if ! is_alert_suppressed "memory_high"; then
      send_alert "WARNING" "memory_high" "Memory usage is ${latest_memory}% (threshold: ${MEMORY_THRESHOLD}%)"
    fi
  fi
}

main() {
  init_alert_db
  evaluate_alerts
}

main "$@"

15.2.5. Testing Integration Points#

#!/bin/bash
# integration-test.sh - Test component interactions

# Create test database
TEST_DB="/tmp/test-metrics-$$.db"

setup_test_env() {
  sqlite3 "$TEST_DB" << 'EOF'
CREATE TABLE cpu_metrics (
  timestamp INTEGER PRIMARY KEY,
  usage_percent REAL
);
INSERT INTO cpu_metrics VALUES ($(date +%s), 88.5);
EOF
}

test_alert_on_high_cpu() {
  # Run alert engine with test data
  METRICS_DB="$TEST_DB" CPU_THRESHOLD=85 ./src/alert-engine.sh
  
  # Verify alert was created
  local alert_count=$(sqlite3 "$TEST_DB" \
    "SELECT COUNT(*) FROM alerts WHERE severity='WARNING' AND component='cpu_high';")
  
  if [[ "$alert_count" -gt 0 ]]; then
    echo "✓ Test: Alert triggered for high CPU"
  else
    echo "✗ Test: Alert NOT triggered for high CPU"
    return 1
  fi
}

cleanup() {
  rm -f "$TEST_DB"
}

setup_test_env
test_alert_on_high_cpu
cleanup
#!/bin/bash
# Example: Implementing a simple metrics collector

source lib/logging.sh
source lib/validation.sh

# Configuration
METRICS_DB="${METRICS_DB:-/var/lib/myapp/metrics.db}"
COLLECTION_INTERVAL="${COLLECTION_INTERVAL:-60}"

# Ensure database exists
init_database() {
  if [[ ! -f "$METRICS_DB" ]]; then
    log_info "Initializing metrics database"
    sqlite3 "$METRICS_DB" << 'EOF'
CREATE TABLE IF NOT EXISTS cpu_metrics (
  timestamp INTEGER PRIMARY KEY,
  usage_percent REAL,
  user_time INTEGER,
  system_time INTEGER
);

CREATE TABLE IF NOT EXISTS memory_metrics (
  timestamp INTEGER PRIMARY KEY,
  total_mb INTEGER,
  used_mb INTEGER,
  free_mb INTEGER,
  usage_percent REAL
);

CREATE INDEX idx_cpu_timestamp ON cpu_metrics(timestamp);
CREATE INDEX idx_memory_timestamp ON memory_metrics(timestamp);
EOF
  fi
}

# Collect CPU metrics
collect_cpu_metrics() {
  local timestamp=$(date +%s)
  
  # Read /proc/stat (simplified)
  local cpu_info=$(grep "^cpu " /proc/stat)
  
  # Parse values (user, nice, system, idle)
  local user=$(echo "$cpu_info" | awk '{print $2}')
  local system=$(echo "$cpu_info" | awk '{print $4}')
  local idle=$(echo "$cpu_info" | awk '{print $5}')
  
  local total=$((user + system + idle))
  local usage=$(awk "BEGIN {printf \"%.1f\", (($user + $system) / $total) * 100}")
  
  # Store in database
  sqlite3 "$METRICS_DB" \
    "INSERT INTO cpu_metrics VALUES ($timestamp, $usage, $user, $system);"
  
  log_info "CPU metric recorded: ${usage}%"
}

# Collect memory metrics
collect_memory_metrics() {
  local timestamp=$(date +%s)
  
  # Read /proc/meminfo
  local meminfo=$(cat /proc/meminfo)
  
  local total=$(echo "$meminfo" | grep "MemTotal" | awk '{print int($2/1024)}')
  local free=$(echo "$meminfo" | grep "MemFree" | awk '{print int($2/1024)}')
  local used=$((total - free))
  local usage=$(awk "BEGIN {printf \"%.1f\", ($used / $total) * 100}")
  
  # Store in database
  sqlite3 "$METRICS_DB" \
    "INSERT INTO memory_metrics VALUES ($timestamp, $total, $used, $free, $usage);"
  
  log_info "Memory metric recorded: ${usage}% ($used/${total}MB)"
}

# Main collection function
main() {
  log_info "Starting metrics collection"
  
  init_database
  collect_cpu_metrics
  collect_memory_metrics
  
  log_info "Metrics collection complete"
}

main "$@"
  Cell In[1], line 4
    source lib/logging.sh
           ^
SyntaxError: invalid syntax

15.2.6. Modular Library Design#

Extract reusable functions into libraries to avoid code duplication:

15.2.6.1. lib/logging.sh#

#!/bin/bash
# Logging utility functions

LOG_FILE="${LOG_FILE:-/var/log/myapp/app.log}"
LOG_LEVEL="${LOG_LEVEL:-INFO}"
DEBUG_MODE="${DEBUG_MODE:-0}"

# Log levels: DEBUG=0, INFO=1, WARNING=2, ERROR=3
declare -A LOG_LEVELS=([DEBUG]=0 [INFO]=1 [WARNING]=2 [ERROR]=3)

log() {
  local level="$1"
  shift
  local message="$@"
  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
  
  # Check if message should be logged based on level
  if [[ "${LOG_LEVELS[$level]}" -ge "${LOG_LEVELS[$LOG_LEVEL]}" ]]; then
    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
    
    # Also print to stderr if DEBUG mode
    if [[ "$DEBUG_MODE" == "1" ]] && [[ "$level" != "DEBUG" ]]; then
      echo "[$timestamp] [$level] $message" >&2
    fi
  fi
}

log_debug() { [[ "$DEBUG_MODE" == "1" ]] && log "DEBUG" "$@"; }
log_info() { log "INFO" "$@"; }
log_warning() { log "WARNING" "$@"; }
log_error() { log "ERROR" "$@"; }

log_function_entry() {
  local func="$1"
  [[ "$DEBUG_MODE" == "1" ]] && log_debug "→ Entering $func with args: $*"
}

log_function_exit() {
  local func="$1"
  local result="$2"
  [[ "$DEBUG_MODE" == "1" ]] && log_debug "← Exiting $func (result: $result)"
}

15.2.6.2. lib/validation.sh#

#!/bin/bash
# Input validation functions

validate_not_empty() {
  local var_name="$1"
  local var_value="$2"
  
  if [[ -z "$var_value" ]]; then
    log_error "$var_name cannot be empty"
    return 1
  fi
}

validate_directory() {
  local dir="$1"
  
  if [[ ! -d "$dir" ]]; then
    log_error "Directory not found: $dir"
    return 1
  fi
  
  if [[ ! -w "$dir" ]]; then
    log_error "Directory not writable: $dir"
    return 1
  fi
}

validate_file_readable() {
  local file="$1"
  
  if [[ ! -r "$file" ]]; then
    log_error "File not readable: $file"
    return 1
  fi
}

validate_command_exists() {
  local cmd="$1"
  
  if ! command -v "$cmd" &> /dev/null; then
    log_error "Required command not found: $cmd"
    return 1
  fi
}

validate_integer() {
  local var_name="$1"
  local var_value="$2"
  
  if ! [[ "$var_value" =~ ^[0-9]+$ ]]; then
    log_error "$var_name must be an integer, got: $var_value"
    return 1
  fi
}

15.2.6.3. lib/arrays.sh#

#!/bin/bash
# Array utility functions

# Check if array contains element
array_contains() {
  local -n arr=$1
  local value="$2"
  
  for element in "${arr[@]}"; do
    [[ "$element" == "$value" ]] && return 0
  done
  return 1
}

# Get array length
array_length() {
  local -n arr=$1
  echo "${#arr[@]}"
}

# Filter array based on condition
array_filter() {
  local -n arr=$1
  local pattern="$2"
  local -a result=()
  
  for element in "${arr[@]}"; do
    if [[ "$element" =~ $pattern ]]; then
      result+=("$element")
    fi
  done
  
  printf '%s\n' "${result[@]}"
}

# Join array elements
array_join() {
  local separator="$1"
  shift
  local -a arr=("$@")
  
  local result=""
  for ((i=0; i<${#arr[@]}; i++)); do
    result+="${arr[$i]}"
    [[ $i -lt $((${#arr[@]}-1)) ]] && result+="$separator"
  done
  
  echo "$result"
}

# Sort array
array_sort() {
  local -n arr=$1
  mapfile -t arr < <(printf '%s\n' "${arr[@]}" | sort)
}

15.2.7. Data Flow and Integration#

Your main script orchestrates the flow:

#!/bin/bash
# main.sh - Orchestrates the monitoring system

source lib/logging.sh
source lib/validation.sh
source lib/arrays.sh

# Load configuration
source config/app.conf

# Initialize
initialize_system() {
  log_info "Starting system initialization"
  
  validate_directory "$LOG_PATH"
  validate_directory "$DATA_PATH"
  validate_command_exists "sqlite3"
  
  log_info "System initialized successfully"
}

# Main execution
main() {
  initialize_system
  
  log_info "Collecting metrics..."
  ./src/metrics-collector.sh
  
  log_info "Aggregating logs..."
  ./src/log-aggregator.sh
  
  log_info "Evaluating alerts..."
  ./src/alert-engine.sh
  
  log_info "Generating reports..."
  ./src/report-generator.sh
  
  log_info "Cycle complete"
}

main "$@"