13.5. Lab: Remote Sync & Networking#

13.5.1. Lab 6: Complete Disaster Recovery Exercise#

Design and implement a complete disaster recovery procedure with testing.

Requirements:

  • Document complete backup and recovery procedures

  • Create backup of all critical data (configs, databases, application)

  • Test restoration to alternative environment

  • Create recovery runbook with step-by-step instructions

  • Establish RTO (Recovery Time Objective) and RPO (Recovery Point Objective)

  • Practice full recovery procedure from scratch

  • Document recovery time and validate completeness

Recovery Targets:

  • Application data: RPO = 24 hours, RTO = 1 hour

  • Database: RPO = 1 hour, RTO = 30 minutes

  • Configuration: RPO = immediate, RTO = 15 minutes

Backup Components:

Critical Systems:
1. Application code and config
2. Database (daily snapshots)
3. User data (incremental backups)
4. System configuration (/etc)
5. SSL certificates
6. SSH keys and secrets

Recovery Procedure:

  1. Verify backup integrity

  2. Provision recovery infrastructure

  3. Restore application code

  4. Restore database to specific point-in-time

  5. Restore user data

  6. Verify all services functional

  7. Run health checks

  8. Document recovery metrics

Recovery Documentation:

DISASTER RECOVERY RUNBOOK
Application: MyApp
Last Tested: 2024-01-15
Status: PASSED

Recovery Time: 32 minutes (target: 60 min) ✓
Data Loss: 15 minutes (target: 1440 min) ✓

Step-by-step recovery instructions...
Contact: dba@example.com

Validation:

  • Backups are restorable (tested regularly)

  • Recovery completes within RTO

  • No data loss exceeds RPO

  • All services operational after recovery

  • Runbook is current and accurate

Bonus:

  • Implement automated recovery testing

  • Geographic disaster recovery (to different region)

  • Ransomware recovery (air-gap backup testing)

  • Database point-in-time recovery to specific timestamp

  • Application failover testing

13.5.2. Lab 5: SSH Key Management and Access Control#

Create an SSH key management system that securely distributes and rotates SSH keys across servers.

Requirements:

  • Generate ED25519 SSH key pairs for deployment user

  • Securely copy public keys to multiple remote servers

  • Verify key-based authentication works on all servers

  • Implement key rotation (replace old keys with new ones)

  • Audit SSH access (log all connections)

  • Remove/disable old keys safely

  • Document who has access to which servers

Key Management Operations:

  1. Generate new deployment key pair

  2. Copy public key to authorized_keys on all servers

  3. Test authentication on each server

  4. Archive old keys with deletion dates

  5. Disable old keys by removing from authorized_keys

  6. Audit: List all users and their keys per server

Key Rotation Workflow:

# Step 1: Generate new key
ssh-keygen -t ed25519 -f ~/.ssh/deploy_key_new -N ""

# Step 2: Add new key to all servers
for server in "${SERVERS[@]}"; do
  cat ~/.ssh/deploy_key_new.pub | \
    ssh -i ~/.ssh/deploy_key user@$server 'cat >> .ssh/authorized_keys'
done

# Step 3: Test new key works on all servers
# Step 4: Remove old key from all servers
# Step 5: Archive old key with metadata

Audit Report:

Access Control Report - 2024-01-15

Server: web1
  deploy key (ed25519): active since 2024-01-01
  backup key (rsa): active since 2023-10-15 ⚠ OLD
  root login: disabled ✓

Server: db1
  deploy key (ed25519): active since 2024-01-01
  backup key (rsa): disabled 2023-12-31
  Root login: disabled ✓

Validation:

  • Keys are securely generated (passphrase protected)

  • Authentication works on all servers

  • Rotation completes without downtime

  • Old keys are properly archived

  • Audit trail shows all access

Bonus:

  • Implement certificate-based SSH auth

  • Integration with secret management (HashiCorp Vault)

  • Automated audit reports with expiry warnings

  • Enforce key age limits

13.5.3. Lab 4: Automated Server Health Monitoring#

Build a health monitoring system that periodically checks multiple servers and generates alerts.

Requirements:

  • Monitor CPU, memory, disk usage on 3+ servers

  • Check service status (web server, database, etc.)

  • Monitor log files for errors

  • Generate hourly/daily health reports

  • Alert on threshold violations

  • Track metrics over time for trend analysis

  • Provide dashboard showing current status

Monitored Metrics:

Per Server:
- CPU usage (alert if > 80%)
- Memory usage (alert if > 85%)
- Disk usage (alert if > 90%)
- Load average
- Process count
- Open file descriptors

Services:
- HTTP/HTTPS responsiveness
- SSH accessibility
- Application-specific health checks

Alert Actions:

  • Email to admin

  • Log to monitoring database

  • Dashboard indicator

  • Metrics recorded for trend analysis

Health Report Example:

=== Daily Health Report ===
Date: 2024-01-15

Server: web1
  CPU: 42% ✓
  Memory: 65% ✓
  Disk: 72% ✓
  Services: All healthy ✓

Server: db1
  CPU: 78% ✓
  Memory: 88% ⚠ (alert)
  Disk: 45% ✓
  Services: Database OK, Backup failed ✗

Alerts: 1 memory warning, 1 service failure

Validation:

  • Monitoring captures expected metrics

  • Alerts trigger at correct thresholds

  • Reports are generated on schedule

  • Historical data accumulates for trending

Bonus:

  • Implement trend analysis (predict capacity issues)

  • Add predictive alerting (alert before threshold)

  • Create web-based dashboard

  • Integration with external monitoring (Datadog, New Relic)

13.5.4. Lab 3: Data Synchronization with Rsync#

Implement an automated data synchronization system using rsync with verification and conflict resolution.

Requirements:

  • Create daily incremental backup from remote server to local

  • Implement full weekly backups

  • Use checksums to verify file integrity

  • Handle partial transfers gracefully

  • Maintain backup history with date-based directories

  • Implement retention policy (keep 7 full backups, 30 days of daily)

  • Generate sync report with statistics

Backup Strategy:

/backups/
  weekly/
    2024-W03/
    2024-W02/
  daily/
    2024-01-15/
    2024-01-14/
  latest -> symlink to most recent backup

Sync Operations:

# Daily incremental
rsync -avz --checksum remote:/data /backups/daily/$(date +%Y-%m-%d)

# Weekly full with verification
rsync -avz --checksum --dry-run first, then without --dry-run

Report Output:

Daily Sync Report - 2024-01-15
Source: remote:/data (125.3 MB)
Destination: /backups/daily/2024-01-15

Transfer Stats:
- Files synced: 342
- New files: 23
- Updated files: 45
- Deleted: 2
- Total transferred: 12.5 MB
- Time: 2m 15s
- Status: SUCCESS

Validation:

  • All files transferred correctly (checksum verified)

  • Partial transfers can be resumed

  • Old backups are cleaned up per policy

  • Space usage is reasonable

Bonus:

  • Implement differential backups (vs incremental)

  • Add compression for archived backups

  • Email detailed sync report

  • Implement restore testing (verify backups can be restored)

13.5.5. Lab 2: Multi-Server Configuration Deployment#

Create a deployment script that safely distributes configuration files to multiple servers with rollback capability.

Requirements:

  • Deploy a configuration file (e.g., nginx.conf or app.config) to 3+ servers

  • Validate configuration on remote servers before activation

  • Create automatic backups of existing configs

  • Support rollback to previous configuration

  • Verify successful deployment with service health checks

  • Log all deployment actions

Deployment Process:

  1. Pre-deployment: Check all servers reachable, disk space sufficient

  2. Backup: Archive current configuration on each server

  3. Deploy: Copy new configuration to all servers

  4. Validate: Test configuration syntax on remote servers

  5. Activate: Reload/restart service

  6. Health Check: Verify service is operational

  7. Rollback: On failure, restore previous configuration

Test Configuration:

# Create simple test config
cat > nginx.test.conf << 'EOF'
server {
  listen 80;
  server_name _;
  location / {
    return 200 "OK";
  }
}
EOF

Validation:

  • Deployment succeeds on all servers

  • Service is accessible after deployment

  • Rollback restores previous config and service works

  • Backup files are created and dated

Bonus:

  • Implement canary deployment (deploy to 1 server first, monitor, then deploy to rest)

  • Send notification email on successful/failed deployment

  • Compare configurations to detect drift

13.5.6. Lab 1: Network Connectivity Verification#

Create a comprehensive network diagnostics script that checks connectivity to multiple hosts and services.

Requirements:

  • Test connectivity to 5+ different hosts using ping

  • Test HTTP/HTTPS connectivity using curl

  • Test DNS resolution

  • Test TCP connectivity to specific ports (SSH, HTTP, HTTPS, DNS)

  • Generate a report showing which services are reachable

  • Handle timeouts gracefully

  • Log results with timestamps

Test Targets:

declare -a HOSTS=("8.8.8.8" "1.1.1.1" "github.com")
declare -A SERVICES=(
  ["google.com:443"]="HTTPS"
  ["github.com:22"]="GitHub SSH"
  ["dns.google:53"]="DNS"
)

Report Output:

=== Network Diagnostics Report ===
Timestamp: 2024-01-15 10:30:00

Host Connectivity:
✓ 8.8.8.8 - 25.3ms
✓ google.com - 12.1ms
✗ unreachable.host - timeout

Service Accessibility:
✓ google.com:443 - HTTPS accessible
✓ github.com:22 - SSH accessible
✗ dns.google:53 - timeout

Summary: 5/6 services reachable

Bonus:

  • Add traceroute output for unreachable hosts

  • Perform DNS lookups and show all returned IPs

  • Check latency trends over time