4. Core Utilities and the Unix Data Model#
At the heart of Unix philosophy lies a simple but powerful idea: everything is text. And if everything is text, you need powerful text processing tools. This chapter introduces you to the command-line utilities that make Unix systems so effective for data manipulation, analysis, and transformation.
How to Use This Chapter
Read the concepts — Understand why each tool exists
Run the examples — Type them yourself, don’t copy-paste
Modify the examples — Change patterns, try different data
Do the lab — Apply everything to real problems
Experiment — Build your own pipelines
This chapter is hands-on. You learn by doing.
A Quick Test
Before starting, you should be comfortable with:
# Navigate and list files
$ cd ~/projects
$ ls -la
# View file contents
$ cat file.txt
$ head file.txt
# Change permissions
$ chmod 755 script.sh
# Redirect output
$ command > output.txt
If any of these are unfamiliar, review Chapters 1-3 first.
Learning Outcomes
By the end of this chapter, you will be able to:
✓ Explain the Unix philosophy of “everything is text”
✓ Understand and use standard input/output/error streams
✓ View and inspect file contents with appropriate tools
✓ Search files using grep and regular expressions
✓ Transform and manipulate text data with sed and awk
✓ Build data processing pipelines using pipes and redirection
✓ Combine multiple tools to solve real data analysis tasks
✓ Debug command pipelines and troubleshoot data processing
✓ Analyze real-world logs and structured data
✓ Write efficient one-liners for complex tasks
Progression Strategy
0402 establishes the philosophy: everything flows as text through pipes.
0403 gives you tools to see what’s in files.
0404 teaches you to find the data you need.
0405 shows you how to shape and transform it.
0406 brings it all together: building efficient pipelines.
0407 applies everything to real scenarios.
Each section builds on the previous. Don’t skip around.
Chapter Map
Section |
Topic |
Key Skills |
|---|---|---|
0402 |
Everything Is A File |
stdin/stdout/stderr, text philosophy, pipes |
0403 |
Viewing & Inspecting Data |
cat, head, tail, less, file, wc, hexdump |
0404 |
Searching, Filtering, Parsing |
grep, regular expressions, find |
0405 |
Data Transformation Tools |
sed, awk, cut, tr, sort, uniq |
0406 |
Pipes & Redirection |
>, >>, <, |, combining, debugging |
0407 |
Lab: Text Pipeline Analysis |
Hands-on exercises, real data analysis |
What You'll Need
Prerequisites:
Familiarity with basic shell commands (Chapter 1)
Understanding of file system navigation (Chapter 2)
Knowledge of permissions and user contexts (Chapter 3)
Mindset:
Think in data flows, not files
Embrace small tools that do one thing well
Learn to read a complex pipeline from left to right
In This Chapter
You’ll learn:
Everything is a file: Understanding stdin, stdout, stderr, and pipes
Viewing data: Tools like
cat,head,tail,less, andfileSearching and filtering:
grep, regular expressions, andfindTransforming data:
sed,awk,cut,tr,sort, anduniqBuilding pipelines: Combining tools with pipes and redirection
Real-world analysis: Processing logs, data files, and structured text
Historical Context
These utilities (grep, sed, awk, etc.) were created in the 1970s-80s and haven’t been fundamentally replaced because they’re elegant solutions to fundamental problems. They represent the Unix philosophy:
Do one thing well — Each tool has a focused purpose
Work with text — Text is universal; binary is not
Compose together — Pipe output to input seamlessly
Be predictable — Consistent behavior across versions
This philosophy has proven so effective that modern tools (Python, jq, etc.) are built on the same principles.
Why These Tools Matter
In Chapters 1-3, you learned the shell basics: navigation, file management, and permissions. Those are the foundations. This chapter is where Unix becomes powerful.
Modern systems generate enormous amounts of data:
Web servers: Millions of requests per day
Applications: Continuous logs of operations
Data science: Gigabytes of raw data to process
System administration: Monitoring hundreds of machines
Without these tools, you’d need to write programs for every task. With them, you can solve complex problems with one-liners.
The Power of Text Processing
Consider these real-world scenarios:
Scenario 1: Log Analysis
# Find all errors in a 10GB log file, count them by type, generate report
$ grep "ERROR" app.log | sed 's/.*ERROR: //' | cut -d' ' -f1 | sort | uniq -c | sort -rn
With a single pipeline, you can process terabytes of data without loading it all into memory.
Scenario 2: Data Transformation
# Convert messy data to clean CSV
$ cat raw_data.txt | grep -v "^#" | sed 's/;/,/g' | awk -F, '{print $2, $1, $5}' > clean_data.csv
Unix tools can transform any text format into any other—no programming needed.
Scenario 3: System Monitoring
# Find processes consuming most memory
$ ps aux | sort -k4 -rn | head -10
# Find disk space by directory
$ du -sh */ | sort -h | tail -10
These utilities aren’t just powerful—they’re composable. You chain them together to build increasingly sophisticated solutions.
Real-World Scenario: Log Analysis and Reporting
Imagine you run a web application server and need to analyze logs to understand user behavior and performance issues. Your application generates a log file like this:
[2024-01-15 10:23:45] INFO: Request from 192.168.1.100 - GET /api/users (200) 145ms
[2024-01-15 10:23:46] ERROR: Database connection failed for 192.168.1.101 - POST /api/data (500) 5000ms
[2024-01-15 10:23:47] WARN: Cache miss on /products/list (202) 892ms
[2024-01-15 10:23:48] INFO: Request from 192.168.1.102 - GET /api/products (200) 234ms
[2024-01-15 10:23:49] ERROR: Timeout on external API call from 192.168.1.100 (503) 30000ms
You need to:
Count errors by type
Find slowest requests
Identify problematic IP addresses
Generate a summary report
Without these tools, you’d write a Python script. With them:
#!/bin/bash
log_file="app.log"
echo "=== LOG ANALYSIS REPORT ==="
echo
echo "Error Summary:"
grep "ERROR" "$log_file" | sed 's/.*ERROR: //' | cut -d' ' -f1 | sort | uniq -c
echo
echo "Slowest Requests (top 5):"
grep "INFO\|ERROR\|WARN" "$log_file" | \
sed 's/.*(\([0-9]*\)ms)/\1/' | \
sort -rn | head -5
echo
echo "Request Count by IP:"
grep "from" "$log_file" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | sort | uniq -c | sort -rn
This solution demonstrates:
grep: Find specific lines (errors, slow requests)
sed: Extract relevant data (error types, response times)
cut: Isolate columns
sort & uniq: Count and aggregate
pipes: Chain operations together
These techniques scale from kilobytes to terabytes.
Sample Script: Text Processing Utility Library
A complete utility library for common text processing tasks:
#!/bin/bash
# lib/textutils.sh - Reusable text processing utilities
# Extract CSV column by name
csv_column() {
local file="$1"
local column_name="$2"
# Find column number from header
local col_num=$(head -1 "$file" | tr ',' '\n' | grep -n "^$column_name$" | cut -d: -f1)
if [ -z "$col_num" ]; then
echo "Column not found: $column_name" >&2
return 1
fi
# Extract that column from data rows
tail -n +2 "$file" | cut -d, -f"$col_num"
}
# Count occurrences of a pattern
count_pattern() {
local file="$1"
local pattern="$2"
grep -c "$pattern" "$file"
}
# Extract lines between two patterns
extract_section() {
local file="$1"
local start="$2"
local end="$3"
sed -n "/$start/,/$end/p" "$file"
}
# Convert tab-separated to comma-separated
tsv_to_csv() {
local file="$1"
tr '\t' ',' < "$file"
}
# Extract specific fields from structured text
extract_fields() {
local file="$1"
shift
local -a fields=("$@")
awk -v fields="${fields[*]}" '
BEGIN {
n = split(fields, f)
for (i = 1; i <= n; i++) print f[i]
}' "$file"
}
# Remove empty lines and comments
clean_text() {
local file="$1"
grep -v '^\s*$' "$file" | grep -v '^\s*#'
}
# Sort by a specific column in CSV
sort_csv_by_column() {
local file="$1"
local col="$2"
(head -1 "$file"; tail -n +2 "$file" | sort -t, -k"$col" -n)
}
# Calculate column statistics
column_stats() {
local file="$1"
local col="$2"
# Assumes numeric column
local data=$(tail -n +2 "$file" | cut -d, -f"$col")
local sum=$(echo "$data" | paste -sd+ | bc)
local count=$(echo "$data" | wc -l)
local avg=$((sum / count))
local max=$(echo "$data" | sort -n | tail -1)
local min=$(echo "$data" | sort -n | head -1)
echo "Sum: $sum"
echo "Count: $count"
echo "Average: $avg"
echo "Max: $max"
echo "Min: $min"
}
# Find lines with duplicate values in a field
find_duplicates() {
local file="$1"
local col="$2"
cut -d, -f"$col" "$file" | sort | uniq -d
}
# Usage example
if [ "${BASH_SOURCE[0]}" == "${0}" ]; then
# Demonstrate the utilities
echo "Text Processing Utilities Loaded"
echo "Usage: source textutils.sh"
echo "Available functions:"
declare -F | awk '{print " - " $3}'
fi
What this library demonstrates:
Using pipes and text tools for data extraction
Pattern matching with grep and sed
Field extraction with cut and awk
Sorting and counting with sort/uniq
Combining tools for complex operations
Writing reusable utility functions
These are the patterns you’ll build throughout this chapter.