Core Utilities and the Unix Data Model

4. Core Utilities and the Unix Data Model#

At the heart of Unix philosophy lies a simple but powerful idea: everything is text. And if everything is text, you need powerful text processing tools. This chapter introduces you to the command-line utilities that make Unix systems so effective for data manipulation, analysis, and transformation.

How to Use This Chapter

Read the concepts — Understand why each tool exists
Run the examples — Type them yourself, don’t copy-paste
Modify the examples — Change patterns, try different data
Do the lab — Apply everything to real problems
Experiment — Build your own pipelines

This chapter is hands-on. You learn by doing.

A Quick Test

Before starting, you should be comfortable with:

# Navigate and list files
$ cd ~/projects
$ ls -la

# View file contents
$ cat file.txt
$ head file.txt

# Change permissions
$ chmod 755 script.sh

# Redirect output
$ command > output.txt

If any of these are unfamiliar, review Chapters 1-3 first.

Learning Outcomes

By the end of this chapter, you will be able to:

✓ Explain the Unix philosophy of “everything is text”
✓ Understand and use standard input/output/error streams
✓ View and inspect file contents with appropriate tools
✓ Search files using grep and regular expressions
✓ Transform and manipulate text data with sed and awk
✓ Build data processing pipelines using pipes and redirection
✓ Combine multiple tools to solve real data analysis tasks
✓ Debug command pipelines and troubleshoot data processing
✓ Analyze real-world logs and structured data
✓ Write efficient one-liners for complex tasks

Progression Strategy

0402 establishes the philosophy: everything flows as text through pipes.

0403 gives you tools to see what’s in files.

0404 teaches you to find the data you need.

0405 shows you how to shape and transform it.

0406 brings it all together: building efficient pipelines.

0407 applies everything to real scenarios.

Each section builds on the previous. Don’t skip around.

Chapter Map

Section	Topic	Key Skills
0402	Everything Is A File	stdin/stdout/stderr, text philosophy, pipes
0403	Viewing & Inspecting Data	cat, head, tail, less, file, wc, hexdump
0404	Searching, Filtering, Parsing	grep, regular expressions, find
0405	Data Transformation Tools	sed, awk, cut, tr, sort, uniq
0406	Pipes & Redirection	>, >>, <, \|, combining, debugging
0407	Lab: Text Pipeline Analysis	Hands-on exercises, real data analysis

What You'll Need

Prerequisites:

Familiarity with basic shell commands (Chapter 1)
Understanding of file system navigation (Chapter 2)
Knowledge of permissions and user contexts (Chapter 3)

Mindset:

Think in data flows, not files
Embrace small tools that do one thing well
Learn to read a complex pipeline from left to right

In This Chapter

You’ll learn:

Everything is a file: Understanding stdin, stdout, stderr, and pipes
Viewing data: Tools like cat, head, tail, less, and file
Searching and filtering: grep, regular expressions, and find
Transforming data: sed, awk, cut, tr, sort, and uniq
Building pipelines: Combining tools with pipes and redirection
Real-world analysis: Processing logs, data files, and structured text

Historical Context

These utilities (grep, sed, awk, etc.) were created in the 1970s-80s and haven’t been fundamentally replaced because they’re elegant solutions to fundamental problems. They represent the Unix philosophy:

Do one thing well — Each tool has a focused purpose
Work with text — Text is universal; binary is not
Compose together — Pipe output to input seamlessly
Be predictable — Consistent behavior across versions

This philosophy has proven so effective that modern tools (Python, jq, etc.) are built on the same principles.

Why These Tools Matter

In Chapters 1-3, you learned the shell basics: navigation, file management, and permissions. Those are the foundations. This chapter is where Unix becomes powerful.

Modern systems generate enormous amounts of data:

Web servers: Millions of requests per day
Applications: Continuous logs of operations
Data science: Gigabytes of raw data to process
System administration: Monitoring hundreds of machines

Without these tools, you’d need to write programs for every task. With them, you can solve complex problems with one-liners.

The Power of Text Processing

Consider these real-world scenarios:

Scenario 1: Log Analysis

# Find all errors in a 10GB log file, count them by type, generate report
$ grep "ERROR" app.log | sed 's/.*ERROR: //' | cut -d' ' -f1 | sort | uniq -c | sort -rn

With a single pipeline, you can process terabytes of data without loading it all into memory.

Scenario 2: Data Transformation

# Convert messy data to clean CSV
$ cat raw_data.txt | grep -v "^#" | sed 's/;/,/g' | awk -F, '{print $2, $1, $5}' > clean_data.csv

Unix tools can transform any text format into any other—no programming needed.

Scenario 3: System Monitoring

# Find processes consuming most memory
$ ps aux | sort -k4 -rn | head -10
# Find disk space by directory
$ du -sh */ | sort -h | tail -10

These utilities aren’t just powerful—they’re composable. You chain them together to build increasingly sophisticated solutions.

Real-World Scenario: Log Analysis and Reporting

Imagine you run a web application server and need to analyze logs to understand user behavior and performance issues. Your application generates a log file like this:

[2024-01-15 10:23:45] INFO: Request from 192.168.1.100 - GET /api/users (200) 145ms
[2024-01-15 10:23:46] ERROR: Database connection failed for 192.168.1.101 - POST /api/data (500) 5000ms
[2024-01-15 10:23:47] WARN: Cache miss on /products/list (202) 892ms
[2024-01-15 10:23:48] INFO: Request from 192.168.1.102 - GET /api/products (200) 234ms
[2024-01-15 10:23:49] ERROR: Timeout on external API call from 192.168.1.100 (503) 30000ms

You need to:

Count errors by type
Find slowest requests
Identify problematic IP addresses
Generate a summary report

Without these tools, you’d write a Python script. With them:

#!/bin/bash

log_file="app.log"

echo "=== LOG ANALYSIS REPORT ==="
echo

echo "Error Summary:"
grep "ERROR" "$log_file" | sed 's/.*ERROR: //' | cut -d' ' -f1 | sort | uniq -c

echo
echo "Slowest Requests (top 5):"
grep "INFO\|ERROR\|WARN" "$log_file" | \
  sed 's/.*(\([0-9]*\)ms)/\1/' | \
  sort -rn | head -5

echo
echo "Request Count by IP:"
grep "from" "$log_file" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | sort | uniq -c | sort -rn

This solution demonstrates:

grep: Find specific lines (errors, slow requests)
sed: Extract relevant data (error types, response times)
cut: Isolate columns
sort & uniq: Count and aggregate
pipes: Chain operations together

These techniques scale from kilobytes to terabytes.

Sample Script: Text Processing Utility Library

A complete utility library for common text processing tasks:

#!/bin/bash

# lib/textutils.sh - Reusable text processing utilities

# Extract CSV column by name
csv_column() {
    local file="$1"
    local column_name="$2"
    
    # Find column number from header
    local col_num=$(head -1 "$file" | tr ',' '\n' | grep -n "^$column_name$" | cut -d: -f1)
    
    if [ -z "$col_num" ]; then
        echo "Column not found: $column_name" >&2
        return 1
    fi
    
    # Extract that column from data rows
    tail -n +2 "$file" | cut -d, -f"$col_num"
}

# Count occurrences of a pattern
count_pattern() {
    local file="$1"
    local pattern="$2"
    
    grep -c "$pattern" "$file"
}

# Extract lines between two patterns
extract_section() {
    local file="$1"
    local start="$2"
    local end="$3"
    
    sed -n "/$start/,/$end/p" "$file"
}

# Convert tab-separated to comma-separated
tsv_to_csv() {
    local file="$1"
    tr '\t' ',' < "$file"
}

# Extract specific fields from structured text
extract_fields() {
    local file="$1"
    shift
    local -a fields=("$@")
    
    awk -v fields="${fields[*]}" '
    BEGIN {
        n = split(fields, f)
        for (i = 1; i <= n; i++) print f[i]
    }' "$file"
}

# Remove empty lines and comments
clean_text() {
    local file="$1"
    grep -v '^\s*$' "$file" | grep -v '^\s*#'
}

# Sort by a specific column in CSV
sort_csv_by_column() {
    local file="$1"
    local col="$2"
    
    (head -1 "$file"; tail -n +2 "$file" | sort -t, -k"$col" -n)
}

# Calculate column statistics
column_stats() {
    local file="$1"
    local col="$2"
    
    # Assumes numeric column
    local data=$(tail -n +2 "$file" | cut -d, -f"$col")
    
    local sum=$(echo "$data" | paste -sd+ | bc)
    local count=$(echo "$data" | wc -l)
    local avg=$((sum / count))
    local max=$(echo "$data" | sort -n | tail -1)
    local min=$(echo "$data" | sort -n | head -1)
    
    echo "Sum: $sum"
    echo "Count: $count"
    echo "Average: $avg"
    echo "Max: $max"
    echo "Min: $min"
}

# Find lines with duplicate values in a field
find_duplicates() {
    local file="$1"
    local col="$2"
    
    cut -d, -f"$col" "$file" | sort | uniq -d
}

# Usage example
if [ "${BASH_SOURCE[0]}" == "${0}" ]; then
    # Demonstrate the utilities
    echo "Text Processing Utilities Loaded"
    echo "Usage: source textutils.sh"
    echo "Available functions:"
    declare -F | awk '{print "  - " $3}'
fi

What this library demonstrates:

Using pipes and text tools for data extraction
Pattern matching with grep and sed
Field extraction with cut and awk
Sorting and counting with sort/uniq
Combining tools for complex operations
Writing reusable utility functions

These are the patterns you’ll build throughout this chapter.