Lab: System Monitoring

10.5. Lab: System Monitoring#

10.5.1. Lab Exercise 6: Resource Usage Reporter#

Create a detailed reporting tool for system administrators.

10.5.1.1. Requirements#

Write resource_report.sh that:

Accepts optional time range (today/week/month/custom)
Collects and aggregates resource usage data
Generates comprehensive multi-section report:
- System overview (uptime, kernel, cores)
- Peak resource times
- Average resource usage
- Top processes (by CPU and memory)
- Disk usage breakdown
- Network statistics (if available)
- Process lifecycle events (crashed services, restarts)
Exports to multiple formats: text, CSV, HTML, JSON
Includes trend analysis with recommendations
Can email report to admin
Can archive reports for audit trail

10.5.1.2. Report Sections#

System Resource Report
Date Range: 2024-01-01 to 2024-01-15
Generated: 2024-01-15 14:32:45

1. SYSTEM OVERVIEW
   Uptime: 45 days
   Kernel: Linux 5.15.0
   CPUs: 4 cores
   Memory: 16GB
   Root: 100GB

2. RESOURCE USAGE SUMMARY
   Avg CPU load: 1.8 (45% util)
   Peak CPU load: 3.9 (98% util) - Jan 13, 14:30
   Avg Memory: 11.2GB (70%)
   Peak Memory: 15.3GB (95%) - Jan 14, 10:15
   Avg Disk: 450GB / 500GB (90%)

3. TOP PROCESSES (by peak CPU)
   chrome (18% avg, 45% peak)
   java (12% avg, 38% peak)
   
4. RECOMMENDATIONS
   - Monitor java process (erratic CPU)
   - Clean logs (/var is at 92%)
   - Consider memory upgrade (frequent >85% states)

10.5.1.3. Hints#

Read from system logs or collect data periodically
Parse /proc for accurate process info
Use arrays to track historical data
Format output with awk, sed, printf
Implement multiple export formats
Add email capability with mail command

10.5.1.4. Testing#

bash resource_report.sh --period week --format html --output report.html
bash resource_report.sh --period today --email admin@example.com

10.5.2. Lab Exercise 5: Performance Analyzer#

Create a tool that identifies system bottlenecks.

10.5.2.1. Requirements#

Write perf_analyzer.sh that:

Takes a sample period in seconds (e.g., 60)
Collects CPU, memory, I/O, and process data
Identifies the bottleneck type:
- CPU-bound: High load, low idle CPU
- Memory-bound: High memory usage, swap usage
- I/O-bound: High disk utilization
- Healthy: All resources normal
Shows which processes contribute most to bottleneck
Provides optimization suggestions
Can compare before/after performance
Outputs detailed report with graphs (ASCII)

10.5.2.2. Example Output#

PERFORMANCE ANALYSIS (60-second sample)
Started: 2024-01-15 14:30:00
Ended:   2024-01-15 14:31:00

System Resources:
  CPU Load: 3.2 (4 cores = 80% utilized) ↗ Increasing
  Memory:   13.2GB / 16GB (82.5%) ↗ High
  Disk I/O: 45% utilization (moderate)

🔴 BOTTLENECK: CPU-BOUND

Top CPU consumers:
  1. gcc      (38%) - Compiling source code
  2. ffmpeg   (28%) - Video encoding
  3. chrome   (18%) - Browser

Recommendations:
  - Reduce concurrent builds
  - Move transcoding to off-peak hours
  - Close unnecessary browser tabs
  - Consider parallel processing for compilation

Load trend:
  14:30 ▁▂▃▄▄▅ 14:31  (steady increase)

10.5.2.3. Hints#

Collect samples at regular intervals
Use uptime, free, iostat, ps
Analyze bottleneck using comparative thresholds
Create ASCII graphs with awk
Save data for trend analysis

10.5.2.4. Testing#

bash perf_analyzer.sh 60

10.5.3. Lab Exercise 4: Memory Alert System#

Create a memory monitoring daemon with adaptive thresholds.

10.5.3.1. Requirements#

Write memory_alert.sh that:

Monitors system memory usage continuously
Identifies memory-hungry processes
Sends alerts when memory pressure builds
Suggests processes to stop or kill (worst offenders)
Tracks memory trends (is it getting worse?)
Generates hourly reports
Has tunable warning thresholds (default: 70%, 85%, 95%)
Can log to syslog or email alerts to admin

10.5.3.2. Alert Levels#

70-80%:   Yellow warning (memory creeping up)
80-90%:   Orange caution (memory very high)
90-95%:   Red critical (system struggling)
95%+:     Emergency (immediate action needed)

10.5.3.3. Example Report#

MEMORY ALERT - 2024-01-15 14:32:45
Current usage: 14.2GB / 16GB (88.8%) 🟠 CAUTION

Top 5 memory consumers:
  1. chrome     (2.5GB, 15.6%) - Browser can be restarted
  2. java       (2.1GB, 13.1%) - Application restart needed
  3. postgres   (1.8GB, 11.3%) - Database (critical!)
  4. apache2    (1.3GB, 8.1%)  - Web server
  5. nodejs     (0.9GB, 5.6%)  - Service process

Trend: +200MB in last hour (↗ increasing)

Recommendation: Restart non-critical services or add RAM

10.5.3.4. Hints#

Use free for total memory
Use ps for process memory
Calculate trends from /proc/meminfo over time
Implement configurable thresholds
Log to file for trend analysis

10.5.3.5. Testing#

bash memory_alert.sh --threshold 70

10.5.4. Lab Exercise 3: Disk Space Monitor#

Create a monitoring service that tracks disk usage and alerts.

10.5.4.1. Requirements#

Write disk_monitor.sh that:

Monitors all mounted filesystems
Tracks usage percentage over time (stores in log file)
Alerts when usage exceeds thresholds:
- Yellow warning at 75%
- Red critical at 90%
Shows usage trends (increasing/stable/decreasing)
Identifies largest directories on full filesystems
Generates daily summary report
Can run continuously (background) or on-demand

10.5.4.2. Example Alert Output#

DISK MONITOR ALERT
Generated: 2024-01-15 14:32:45

🟡 WARNING: /home at 78% (was 71% yesterday - trending up)
  Largest directories:
  1. alice/      - 450GB
  2. backup/     - 280GB
  3. shared/     - 200GB
  
  Recommendation: Clean old files or add storage

🔴 CRITICAL: /var at 92% (was 89% - rapidly increasing)
  Largest directories:
  1. log/        - 145GB
  
  Recommendation: Immediately archive or delete old logs

10.5.4.3. Hints#

Use df -h for filesystem info
Store daily readings in /tmp or log file
Calculate trends from historical data
Find large dirs with du -sh
Use conditional logic for thresholds

10.5.4.4. Testing#

# Test on-demand
bash disk_monitor.sh

# Test background monitoring
bash disk_monitor.sh &
sleep 10 && killall disk_monitor.sh

10.5.5. Lab Exercise 2: Process Killer Utility#

Create an interactive process selection and termination tool.

10.5.5.1. Requirements#

Write kill_menu.sh that:

Lists all user processes (exclude system processes)
Shows PID, user, CPU%, memory%, and command name
Lets user select processes by number
Offers graceful terminate (SIGTERM) or force kill (SIGKILL)
Confirms before killing
Reports success or failure
Repeats until user chooses to exit

10.5.5.2. Example Session#

Select processes to manage:
[1] firefox    (PID: 2345, CPU: 12%, Mem: 1.5GB)
[2] node       (PID: 3456, CPU: 8%, Mem: 512MB)
[3] python     (PID: 4567, CPU: 2%, Mem: 256MB)

Enter process numbers (space-separated) or 'q' to quit: 1 3

You selected:
  - firefox (PID 2345)
  - python (PID 4567)

Kill type: (g)raceful, (f)orce, (c)ancel: g

Sending SIGTERM to:
  ✓ firefox (2345) - terminating
  ✓ python (4567) - terminating

Continue? (y/n): n

10.5.5.3. Hints#

Use ps aux --sort=-%mem for listings
Prompt user for selection
Use read for interactive input
Check if process exists before killing
Handle errors gracefully

10.5.5.4. Testing#

bash kill_menu.sh
# Test various inputs, verify kills work

10.5.6. Lab Exercise 1: System Health Check Script#

Create a script that generates a comprehensive system health report.

10.5.6.1. Requirements#

Write system_health.sh that outputs:

Timestamp of report
System uptime and load average
CPU core count and current load percentage
Memory usage (total, used, free, percentage)
Top 3 processes by CPU usage
Top 3 processes by memory usage
Disk usage for all mounted filesystems (> 1GB)
Number of running, sleeping, stopped, and zombie processes
Alert if any resource exceeds 80% threshold

10.5.6.2. Example Output#

=== System Health Report ===
Generated: 2024-01-15 14:32:45
Uptime: 45 days, 3:12

CPU: 4 cores, Load average: 1.23 (30.8% utilized)

Memory: 16GB total | 12.3GB used (76.9%) | 3.7GB free
⚠ Memory usage > 75%

Top CPU consumers:
  1. firefox (12.3%) - 2048MB
  2. chrome (8.5%) - 1536MB
  3. node (5.2%) - 512MB

Disk Usage:
  /     : 45.2GB / 100GB (45.2%)
  /var  : 23.1GB / 50GB (46.2%)
  /home : 89.5GB / 500GB (17.9%)

Processes: Running: 185 | Sleeping: 342 | Stopped: 0 | Zombie: 0

10.5.6.3. Hints#

Use uptime, uname, df, ps commands
Format output nicely with printf or awk
Use /proc filesystem for detailed stats
Add color codes for alerts (awk with ANSI colors)

10.5.6.4. Testing#

bash system_health.sh

Verify that:

All values update correctly
Alerts appear when thresholds exceeded
Output is well-formatted