Lab: Text Pipeline Analysis

4.6. Lab: Text Pipeline Analysis#

In this lab, you’ll apply everything from Chapter 4: viewing files, searching with grep, transforming data with sed/awk, and building powerful pipelines.

4.6.1. Part 1: Data Viewing and Exploration#

4.6.1.1. Exercise 1.1: Understanding Your Dataset#

Create sample datasets to work with:

# Create a sample web server log
$ cat > access.log << 'EOF'
192.168.1.1 - - [15/Jan/2025:10:23:45] "GET /index.html HTTP/1.1" 200 1234
10.0.0.5 - - [15/Jan/2025:10:24:12] "GET /about.html HTTP/1.1" 200 2456
192.168.1.1 - - [15/Jan/2025:10:25:33] "POST /api/user HTTP/1.1" 201 567
10.0.0.8 - - [15/Jan/2025:10:26:01] "GET /index.html HTTP/1.1" 200 1234
192.168.1.1 - - [15/Jan/2025:10:27:15] "GET /styles.css HTTP/1.1" 404 0
10.0.0.5 - - [15/Jan/2025:10:28:42] "GET /api/posts HTTP/1.1" 200 5432
192.168.1.2 - - [15/Jan/2025:10:29:10] "GET /index.html HTTP/1.1" 200 1234
10.0.0.8 - - [15/Jan/2025:10:30:22] "GET /about.html HTTP/1.1" 200 2456
EOF

# Create sample user data
$ cat > users.csv << 'EOF'
id,name,age,department,salary
1,Alice Johnson,28,Engineering,85000
2,Bob Smith,35,Sales,65000
3,Charlie Brown,42,Engineering,95000
4,Diana Prince,31,Marketing,72000
5,Eve Wilson,29,Engineering,88000
6,Frank Thomas,38,Sales,68000
7,Grace Lee,45,Engineering,105000
8,Henry Davis,33,HR,62000
EOF

# Create application log file
$ cat > app.log << 'EOF'
[2025-01-15 10:15:23] INFO Starting application
[2025-01-15 10:15:24] INFO Loading configuration
[2025-01-15 10:15:25] ERROR Database connection failed: timeout
[2025-01-15 10:15:26] ERROR Retrying connection...
[2025-01-15 10:15:27] INFO Database connected
[2025-01-15 10:15:28] DEBUG Loading user cache
[2025-01-15 10:15:29] INFO Cache ready
[2025-01-15 10:16:15] WARNING High memory usage: 85%
[2025-01-15 10:17:32] ERROR API request failed: 500
[2025-01-15 10:17:33] INFO Restarting failed service
EOF

4.6.1.2. Exercise 1.2: Explore the Data#

# 1. View the web server log
$ cat access.log
# Questions:
# - What's the format?
# - How many columns?
# - What information is in each field?

# 2. Determine file types
$ file access.log users.csv app.log

# 3. Count lines
$ wc -l access.log users.csv app.log

# 4. View first few lines of users.csv
$ head -3 users.csv

# 5. View last few lines of app.log
$ tail -4 app.log

# 6. Get column count
$ head -1 users.csv | tr ',' '\n' | nl
# How many columns does users.csv have?

4.6.2. Part 2: Searching and Filtering#

4.6.2.1. Exercise 2.1: Find Patterns in Logs#

# 1. Find all HTTP errors (status code 40x or 50x)
$ grep -E '40[0-9]|50[0-9]' access.log

# 2. Count ERROR lines in app.log
$ grep -c "ERROR" app.log

# 3. Find lines with ERROR or WARNING
$ grep -E "ERROR|WARNING" app.log

# 4. Get all GET requests
$ grep "GET" access.log

# 5. Show line numbers for errors
$ grep -n "ERROR" app.log

4.6.2.2. Exercise 2.2: Extract Specific Information#

# Using field extraction from Chapter 0404

# 1. Get all IP addresses from access.log
$ awk '{print $1}' access.log | sort | uniq

# 2. Extract response codes
$ awk '{print $9}' access.log | sort | uniq -c

# 3. Get all requested paths (HTTP request part)
$ awk '{print $7}' access.log | sort | uniq -c

4.6.3. Part 3: Data Transformation#

4.6.3.1. Exercise 3.1: Transform CSV Data#

# 1. Extract just names and salaries from users.csv
$ awk -F, 'NR>1 {print $2, $5}' users.csv

# 2. Find high earners (salary > $85,000)
$ awk -F, '$5 > 85000 {print $2 ": $" $5}' users.csv

# 3. Count employees by department
$ awk -F, 'NR>1 {print $4}' users.csv | sort | uniq -c

# 4. Calculate average salary (engineering department)
$ awk -F, '$4 == "Engineering" {sum += $5; count++} END {print "Avg: $" sum/count}' users.csv

4.6.3.2. Exercise 3.2: Fix and Reformat Data#

# 1. Add "Dr." prefix to names in app logs
$ sed 's/ERROR/[ERROR]/g' app.log

# 2. Convert log timestamps to different format
$ sed 's/\[2025-01-15 //' app.log

# 3. Extract just the message part from app.log
$ sed 's/.*] //' app.log
# Removes timestamp and priority

4.6.4. Part 4: Building Pipelines#

4.6.4.1. Exercise 4.1: Analyze Web Server Logs#

# 1. Find most requested pages
$ awk '{print $7}' access.log | sort | uniq -c | sort -rn

# 2. Find most active IP addresses
$ awk '{print $1}' access.log | sort | uniq -c | sort -rn

# 3. Calculate total bandwidth used
$ awk '{sum += $10} END {print "Total bytes: " sum}' access.log

# 4. Find all failed requests and show which paths
$ awk '$9 >= 400 {print $7, $9}' access.log | sort -k2 -n

# 5. Generate a report: IP, request count, total bytes
$ awk '{ip[$1]++; bytes[$1]+=$10} END {for (i in ip) print i, ip[i], bytes[i]}' access.log | sort

4.6.4.2. Exercise 4.2: Analyze Application Logs#

# 1. Count log levels
$ grep -oE '\] [A-Z]+' app.log | sed 's/] //' | sort | uniq -c

# 2. Find all error messages
$ grep "ERROR" app.log | sed 's/.*ERROR //' | sort -u

# 3. Timeline: count events per minute
$ sed 's/:.*//;s/.*://' app.log | sort | uniq -c

# 4. Find warnings and their context
$ grep -B1 "WARNING" app.log

4.6.5. Part 5: Complex Real-World Scenarios#

4.6.5.1. Scenario A: Log Analysis Report#

# Generate a complete web server analysis
$ cat > analyze_web.sh << 'EOF'
#!/bin/bash
echo "=== Web Server Analysis ==="
echo
echo "Total requests: $(wc -l < access.log)"
echo
echo "Top 5 requested paths:"
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -5
echo
echo "Error distribution:"
awk '$9 >= 400 {print $9}' access.log | sort | uniq -c | sort -k2
echo
echo "Top IPs by request count:"
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -5
EOF

$ bash analyze_web.sh

4.6.5.2. Scenario B: Data Quality Check#

# Validate users.csv data
$ cat > validate_users.sh << 'EOF'
#!/bin/bash
echo "=== Data Quality Report ==="
echo
echo "Total records: $(tail -n +2 users.csv | wc -l)"
echo
echo "Records with missing fields:"
awk -F, 'NF != 5 {print NR": " $0}' users.csv
echo
echo "Invalid salary (non-numeric):"
awk -F, 'NR>1 && $5 !~ /^[0-9]+$/ {print NR": " $2 " - " $5}' users.csv
echo
echo "Salary range:"
awk -F, 'NR>1 {print $5}' users.csv | sort -n | head -1
echo "Lowest: $(awk -F, 'NR>1 {print $5}' users.csv | sort -n | head -1)"
echo "Highest: $(awk -F, 'NR>1 {print $5}' users.csv | sort -n | tail -1)"
EOF

$ bash validate_users.sh

4.6.6. Reflection Questions#

Pipeline Design
- In Exercise 4.1, why do we sort before using uniq?
- How would you modify the pipeline to show only results > 5 occurrences?
Performance
- Which is faster: grep | grep or grep -E "pattern1|pattern2"?
- When would you use tee in a pipeline?
Data Quality
- What validation checks would you add for CSV data?
- How would you handle CSV fields containing commas?
Real-World Application
- What log analysis would be useful for your own projects?
- How would you monitor for anomalies in these logs?
Tool Choice
- When would you use sed vs awk for the same task?
- What can awk do that grep cannot?

4.6.7. Challenge Exercises#

4.6.7.1. Challenge 1: Multi-file Analysis#

# Create multiple log files and analyze across them
# Find users appearing in both access.log and app.log
# Generate a report showing correlation

4.6.7.2. Challenge 2: ETL Pipeline#

# Build a complete pipeline that:
# 1. Reads users.csv
# 2. Filters to specific department
# 3. Validates salary data
# 4. Reformats output
# 5. Saves to new file

4.6.7.3. Challenge 3: Log Anomaly Detection#

# Detect unusual patterns in app.log:
# - Sudden spike in errors
# - Missing expected INFO messages
# - Response time degradation

Lab: Text Pipeline Analysis

Contents

4.6. Lab: Text Pipeline Analysis#

4.6.1. Part 1: Data Viewing and Exploration#

4.6.1.1. Exercise 1.1: Understanding Your Dataset#

4.6.1.2. Exercise 1.2: Explore the Data#

4.6.2. Part 2: Searching and Filtering#

4.6.2.1. Exercise 2.1: Find Patterns in Logs#

4.6.2.2. Exercise 2.2: Extract Specific Information#

4.6.3. Part 3: Data Transformation#

4.6.3.1. Exercise 3.1: Transform CSV Data#

4.6.3.2. Exercise 3.2: Fix and Reformat Data#

4.6.4. Part 4: Building Pipelines#

4.6.4.1. Exercise 4.1: Analyze Web Server Logs#

4.6.4.2. Exercise 4.2: Analyze Application Logs#

4.6.5. Part 5: Complex Real-World Scenarios#

4.6.5.1. Scenario A: Log Analysis Report#

4.6.5.2. Scenario B: Data Quality Check#

4.6.6. Reflection Questions#

4.6.7. Challenge Exercises#

4.6.7.1. Challenge 1: Multi-file Analysis#

4.6.7.2. Challenge 2: ETL Pipeline#

4.6.7.3. Challenge 3: Log Anomaly Detection#