4.6. Lab: Text Pipeline Analysis#
In this lab, you’ll apply everything from Chapter 4: viewing files, searching with grep, transforming data with sed/awk, and building powerful pipelines.
4.6.1. Part 1: Data Viewing and Exploration#
4.6.1.1. Exercise 1.1: Understanding Your Dataset#
Create sample datasets to work with:
# Create a sample web server log
$ cat > access.log << 'EOF'
192.168.1.1 - - [15/Jan/2025:10:23:45] "GET /index.html HTTP/1.1" 200 1234
10.0.0.5 - - [15/Jan/2025:10:24:12] "GET /about.html HTTP/1.1" 200 2456
192.168.1.1 - - [15/Jan/2025:10:25:33] "POST /api/user HTTP/1.1" 201 567
10.0.0.8 - - [15/Jan/2025:10:26:01] "GET /index.html HTTP/1.1" 200 1234
192.168.1.1 - - [15/Jan/2025:10:27:15] "GET /styles.css HTTP/1.1" 404 0
10.0.0.5 - - [15/Jan/2025:10:28:42] "GET /api/posts HTTP/1.1" 200 5432
192.168.1.2 - - [15/Jan/2025:10:29:10] "GET /index.html HTTP/1.1" 200 1234
10.0.0.8 - - [15/Jan/2025:10:30:22] "GET /about.html HTTP/1.1" 200 2456
EOF
# Create sample user data
$ cat > users.csv << 'EOF'
id,name,age,department,salary
1,Alice Johnson,28,Engineering,85000
2,Bob Smith,35,Sales,65000
3,Charlie Brown,42,Engineering,95000
4,Diana Prince,31,Marketing,72000
5,Eve Wilson,29,Engineering,88000
6,Frank Thomas,38,Sales,68000
7,Grace Lee,45,Engineering,105000
8,Henry Davis,33,HR,62000
EOF
# Create application log file
$ cat > app.log << 'EOF'
[2025-01-15 10:15:23] INFO Starting application
[2025-01-15 10:15:24] INFO Loading configuration
[2025-01-15 10:15:25] ERROR Database connection failed: timeout
[2025-01-15 10:15:26] ERROR Retrying connection...
[2025-01-15 10:15:27] INFO Database connected
[2025-01-15 10:15:28] DEBUG Loading user cache
[2025-01-15 10:15:29] INFO Cache ready
[2025-01-15 10:16:15] WARNING High memory usage: 85%
[2025-01-15 10:17:32] ERROR API request failed: 500
[2025-01-15 10:17:33] INFO Restarting failed service
EOF
4.6.1.2. Exercise 1.2: Explore the Data#
# 1. View the web server log
$ cat access.log
# Questions:
# - What's the format?
# - How many columns?
# - What information is in each field?
# 2. Determine file types
$ file access.log users.csv app.log
# 3. Count lines
$ wc -l access.log users.csv app.log
# 4. View first few lines of users.csv
$ head -3 users.csv
# 5. View last few lines of app.log
$ tail -4 app.log
# 6. Get column count
$ head -1 users.csv | tr ',' '\n' | nl
# How many columns does users.csv have?
4.6.2. Part 2: Searching and Filtering#
4.6.2.1. Exercise 2.1: Find Patterns in Logs#
# 1. Find all HTTP errors (status code 40x or 50x)
$ grep -E '40[0-9]|50[0-9]' access.log
# 2. Count ERROR lines in app.log
$ grep -c "ERROR" app.log
# 3. Find lines with ERROR or WARNING
$ grep -E "ERROR|WARNING" app.log
# 4. Get all GET requests
$ grep "GET" access.log
# 5. Show line numbers for errors
$ grep -n "ERROR" app.log
4.6.2.2. Exercise 2.2: Extract Specific Information#
# Using field extraction from Chapter 0404
# 1. Get all IP addresses from access.log
$ awk '{print $1}' access.log | sort | uniq
# 2. Extract response codes
$ awk '{print $9}' access.log | sort | uniq -c
# 3. Get all requested paths (HTTP request part)
$ awk '{print $7}' access.log | sort | uniq -c
4.6.3. Part 3: Data Transformation#
4.6.3.1. Exercise 3.1: Transform CSV Data#
# 1. Extract just names and salaries from users.csv
$ awk -F, 'NR>1 {print $2, $5}' users.csv
# 2. Find high earners (salary > $85,000)
$ awk -F, '$5 > 85000 {print $2 ": $" $5}' users.csv
# 3. Count employees by department
$ awk -F, 'NR>1 {print $4}' users.csv | sort | uniq -c
# 4. Calculate average salary (engineering department)
$ awk -F, '$4 == "Engineering" {sum += $5; count++} END {print "Avg: $" sum/count}' users.csv
4.6.3.2. Exercise 3.2: Fix and Reformat Data#
# 1. Add "Dr." prefix to names in app logs
$ sed 's/ERROR/[ERROR]/g' app.log
# 2. Convert log timestamps to different format
$ sed 's/\[2025-01-15 //' app.log
# 3. Extract just the message part from app.log
$ sed 's/.*] //' app.log
# Removes timestamp and priority
4.6.4. Part 4: Building Pipelines#
4.6.4.1. Exercise 4.1: Analyze Web Server Logs#
# 1. Find most requested pages
$ awk '{print $7}' access.log | sort | uniq -c | sort -rn
# 2. Find most active IP addresses
$ awk '{print $1}' access.log | sort | uniq -c | sort -rn
# 3. Calculate total bandwidth used
$ awk '{sum += $10} END {print "Total bytes: " sum}' access.log
# 4. Find all failed requests and show which paths
$ awk '$9 >= 400 {print $7, $9}' access.log | sort -k2 -n
# 5. Generate a report: IP, request count, total bytes
$ awk '{ip[$1]++; bytes[$1]+=$10} END {for (i in ip) print i, ip[i], bytes[i]}' access.log | sort
4.6.4.2. Exercise 4.2: Analyze Application Logs#
# 1. Count log levels
$ grep -oE '\] [A-Z]+' app.log | sed 's/] //' | sort | uniq -c
# 2. Find all error messages
$ grep "ERROR" app.log | sed 's/.*ERROR //' | sort -u
# 3. Timeline: count events per minute
$ sed 's/:.*//;s/.*://' app.log | sort | uniq -c
# 4. Find warnings and their context
$ grep -B1 "WARNING" app.log
4.6.5. Part 5: Complex Real-World Scenarios#
4.6.5.1. Scenario A: Log Analysis Report#
# Generate a complete web server analysis
$ cat > analyze_web.sh << 'EOF'
#!/bin/bash
echo "=== Web Server Analysis ==="
echo
echo "Total requests: $(wc -l < access.log)"
echo
echo "Top 5 requested paths:"
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -5
echo
echo "Error distribution:"
awk '$9 >= 400 {print $9}' access.log | sort | uniq -c | sort -k2
echo
echo "Top IPs by request count:"
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -5
EOF
$ bash analyze_web.sh
4.6.5.2. Scenario B: Data Quality Check#
# Validate users.csv data
$ cat > validate_users.sh << 'EOF'
#!/bin/bash
echo "=== Data Quality Report ==="
echo
echo "Total records: $(tail -n +2 users.csv | wc -l)"
echo
echo "Records with missing fields:"
awk -F, 'NF != 5 {print NR": " $0}' users.csv
echo
echo "Invalid salary (non-numeric):"
awk -F, 'NR>1 && $5 !~ /^[0-9]+$/ {print NR": " $2 " - " $5}' users.csv
echo
echo "Salary range:"
awk -F, 'NR>1 {print $5}' users.csv | sort -n | head -1
echo "Lowest: $(awk -F, 'NR>1 {print $5}' users.csv | sort -n | head -1)"
echo "Highest: $(awk -F, 'NR>1 {print $5}' users.csv | sort -n | tail -1)"
EOF
$ bash validate_users.sh
4.6.6. Reflection Questions#
Pipeline Design
In Exercise 4.1, why do we sort before using uniq?
How would you modify the pipeline to show only results > 5 occurrences?
Performance
Which is faster:
grep | greporgrep -E "pattern1|pattern2"?When would you use
teein a pipeline?
Data Quality
What validation checks would you add for CSV data?
How would you handle CSV fields containing commas?
Real-World Application
What log analysis would be useful for your own projects?
How would you monitor for anomalies in these logs?
Tool Choice
When would you use
sedvsawkfor the same task?What can
awkdo thatgrepcannot?
4.6.7. Challenge Exercises#
4.6.7.1. Challenge 1: Multi-file Analysis#
# Create multiple log files and analyze across them
# Find users appearing in both access.log and app.log
# Generate a report showing correlation
4.6.7.2. Challenge 2: ETL Pipeline#
# Build a complete pipeline that:
# 1. Reads users.csv
# 2. Filters to specific department
# 3. Validates salary data
# 4. Reformats output
# 5. Saves to new file
4.6.7.3. Challenge 3: Log Anomaly Detection#
# Detect unusual patterns in app.log:
# - Sudden spike in errors
# - Missing expected INFO messages
# - Response time degradation