10.5. Lab: System Monitoring#
10.5.1. Lab Exercise 6: Resource Usage Reporter#
Create a detailed reporting tool for system administrators.
10.5.1.1. Requirements#
Write resource_report.sh that:
Accepts optional time range (today/week/month/custom)
Collects and aggregates resource usage data
Generates comprehensive multi-section report:
System overview (uptime, kernel, cores)
Peak resource times
Average resource usage
Top processes (by CPU and memory)
Disk usage breakdown
Network statistics (if available)
Process lifecycle events (crashed services, restarts)
Exports to multiple formats: text, CSV, HTML, JSON
Includes trend analysis with recommendations
Can email report to admin
Can archive reports for audit trail
10.5.1.2. Report Sections#
System Resource Report
Date Range: 2024-01-01 to 2024-01-15
Generated: 2024-01-15 14:32:45
1. SYSTEM OVERVIEW
Uptime: 45 days
Kernel: Linux 5.15.0
CPUs: 4 cores
Memory: 16GB
Root: 100GB
2. RESOURCE USAGE SUMMARY
Avg CPU load: 1.8 (45% util)
Peak CPU load: 3.9 (98% util) - Jan 13, 14:30
Avg Memory: 11.2GB (70%)
Peak Memory: 15.3GB (95%) - Jan 14, 10:15
Avg Disk: 450GB / 500GB (90%)
3. TOP PROCESSES (by peak CPU)
chrome (18% avg, 45% peak)
java (12% avg, 38% peak)
4. RECOMMENDATIONS
- Monitor java process (erratic CPU)
- Clean logs (/var is at 92%)
- Consider memory upgrade (frequent >85% states)
10.5.1.3. Hints#
Read from system logs or collect data periodically
Parse
/procfor accurate process infoUse arrays to track historical data
Format output with
awk,sed,printfImplement multiple export formats
Add email capability with
mailcommand
10.5.1.4. Testing#
bash resource_report.sh --period week --format html --output report.html
bash resource_report.sh --period today --email admin@example.com
10.5.2. Lab Exercise 5: Performance Analyzer#
Create a tool that identifies system bottlenecks.
10.5.2.1. Requirements#
Write perf_analyzer.sh that:
Takes a sample period in seconds (e.g., 60)
Collects CPU, memory, I/O, and process data
Identifies the bottleneck type:
CPU-bound: High load, low idle CPU
Memory-bound: High memory usage, swap usage
I/O-bound: High disk utilization
Healthy: All resources normal
Shows which processes contribute most to bottleneck
Provides optimization suggestions
Can compare before/after performance
Outputs detailed report with graphs (ASCII)
10.5.2.2. Example Output#
PERFORMANCE ANALYSIS (60-second sample)
Started: 2024-01-15 14:30:00
Ended: 2024-01-15 14:31:00
System Resources:
CPU Load: 3.2 (4 cores = 80% utilized) ↗ Increasing
Memory: 13.2GB / 16GB (82.5%) ↗ High
Disk I/O: 45% utilization (moderate)
🔴 BOTTLENECK: CPU-BOUND
Top CPU consumers:
1. gcc (38%) - Compiling source code
2. ffmpeg (28%) - Video encoding
3. chrome (18%) - Browser
Recommendations:
- Reduce concurrent builds
- Move transcoding to off-peak hours
- Close unnecessary browser tabs
- Consider parallel processing for compilation
Load trend:
14:30 ▁▂▃▄▄▅ 14:31 (steady increase)
10.5.2.3. Hints#
Collect samples at regular intervals
Use
uptime,free,iostat,psAnalyze bottleneck using comparative thresholds
Create ASCII graphs with awk
Save data for trend analysis
10.5.2.4. Testing#
bash perf_analyzer.sh 60
10.5.3. Lab Exercise 4: Memory Alert System#
Create a memory monitoring daemon with adaptive thresholds.
10.5.3.1. Requirements#
Write memory_alert.sh that:
Monitors system memory usage continuously
Identifies memory-hungry processes
Sends alerts when memory pressure builds
Suggests processes to stop or kill (worst offenders)
Tracks memory trends (is it getting worse?)
Generates hourly reports
Has tunable warning thresholds (default: 70%, 85%, 95%)
Can log to syslog or email alerts to admin
10.5.3.2. Alert Levels#
70-80%: Yellow warning (memory creeping up)
80-90%: Orange caution (memory very high)
90-95%: Red critical (system struggling)
95%+: Emergency (immediate action needed)
10.5.3.3. Example Report#
MEMORY ALERT - 2024-01-15 14:32:45
Current usage: 14.2GB / 16GB (88.8%) 🟠 CAUTION
Top 5 memory consumers:
1. chrome (2.5GB, 15.6%) - Browser can be restarted
2. java (2.1GB, 13.1%) - Application restart needed
3. postgres (1.8GB, 11.3%) - Database (critical!)
4. apache2 (1.3GB, 8.1%) - Web server
5. nodejs (0.9GB, 5.6%) - Service process
Trend: +200MB in last hour (↗ increasing)
Recommendation: Restart non-critical services or add RAM
10.5.3.4. Hints#
Use
freefor total memoryUse
psfor process memoryCalculate trends from
/proc/meminfoover timeImplement configurable thresholds
Log to file for trend analysis
10.5.3.5. Testing#
bash memory_alert.sh --threshold 70
10.5.4. Lab Exercise 3: Disk Space Monitor#
Create a monitoring service that tracks disk usage and alerts.
10.5.4.1. Requirements#
Write disk_monitor.sh that:
Monitors all mounted filesystems
Tracks usage percentage over time (stores in log file)
Alerts when usage exceeds thresholds:
Yellow warning at 75%
Red critical at 90%
Shows usage trends (increasing/stable/decreasing)
Identifies largest directories on full filesystems
Generates daily summary report
Can run continuously (background) or on-demand
10.5.4.2. Example Alert Output#
DISK MONITOR ALERT
Generated: 2024-01-15 14:32:45
🟡 WARNING: /home at 78% (was 71% yesterday - trending up)
Largest directories:
1. alice/ - 450GB
2. backup/ - 280GB
3. shared/ - 200GB
Recommendation: Clean old files or add storage
🔴 CRITICAL: /var at 92% (was 89% - rapidly increasing)
Largest directories:
1. log/ - 145GB
Recommendation: Immediately archive or delete old logs
10.5.4.3. Hints#
Use
df -hfor filesystem infoStore daily readings in
/tmpor log fileCalculate trends from historical data
Find large dirs with
du -shUse conditional logic for thresholds
10.5.4.4. Testing#
# Test on-demand
bash disk_monitor.sh
# Test background monitoring
bash disk_monitor.sh &
sleep 10 && killall disk_monitor.sh
10.5.5. Lab Exercise 2: Process Killer Utility#
Create an interactive process selection and termination tool.
10.5.5.1. Requirements#
Write kill_menu.sh that:
Lists all user processes (exclude system processes)
Shows PID, user, CPU%, memory%, and command name
Lets user select processes by number
Offers graceful terminate (SIGTERM) or force kill (SIGKILL)
Confirms before killing
Reports success or failure
Repeats until user chooses to exit
10.5.5.2. Example Session#
Select processes to manage:
[1] firefox (PID: 2345, CPU: 12%, Mem: 1.5GB)
[2] node (PID: 3456, CPU: 8%, Mem: 512MB)
[3] python (PID: 4567, CPU: 2%, Mem: 256MB)
Enter process numbers (space-separated) or 'q' to quit: 1 3
You selected:
- firefox (PID 2345)
- python (PID 4567)
Kill type: (g)raceful, (f)orce, (c)ancel: g
Sending SIGTERM to:
✓ firefox (2345) - terminating
✓ python (4567) - terminating
Continue? (y/n): n
10.5.5.3. Hints#
Use
ps aux --sort=-%memfor listingsPrompt user for selection
Use
readfor interactive inputCheck if process exists before killing
Handle errors gracefully
10.5.5.4. Testing#
bash kill_menu.sh
# Test various inputs, verify kills work
10.5.6. Lab Exercise 1: System Health Check Script#
Create a script that generates a comprehensive system health report.
10.5.6.1. Requirements#
Write system_health.sh that outputs:
Timestamp of report
System uptime and load average
CPU core count and current load percentage
Memory usage (total, used, free, percentage)
Top 3 processes by CPU usage
Top 3 processes by memory usage
Disk usage for all mounted filesystems (> 1GB)
Number of running, sleeping, stopped, and zombie processes
Alert if any resource exceeds 80% threshold
10.5.6.2. Example Output#
=== System Health Report ===
Generated: 2024-01-15 14:32:45
Uptime: 45 days, 3:12
CPU: 4 cores, Load average: 1.23 (30.8% utilized)
Memory: 16GB total | 12.3GB used (76.9%) | 3.7GB free
⚠ Memory usage > 75%
Top CPU consumers:
1. firefox (12.3%) - 2048MB
2. chrome (8.5%) - 1536MB
3. node (5.2%) - 512MB
Disk Usage:
/ : 45.2GB / 100GB (45.2%)
/var : 23.1GB / 50GB (46.2%)
/home : 89.5GB / 500GB (17.9%)
Processes: Running: 185 | Sleeping: 342 | Stopped: 0 | Zombie: 0
10.5.6.3. Hints#
Use
uptime,uname,df,pscommandsFormat output nicely with
printforawkUse
/procfilesystem for detailed statsAdd color codes for alerts (awk with ANSI colors)
10.5.6.4. Testing#
bash system_health.sh
Verify that:
All values update correctly
Alerts appear when thresholds exceeded
Output is well-formatted