Core Utilities and the Unix Data Model

4. Core Utilities and the Unix Data Model#

At the heart of Unix philosophy lies a simple but powerful idea: everything is text. And if everything is text, you need powerful text processing tools. This chapter introduces you to the command-line utilities that make Unix systems so effective for data manipulation, analysis, and transformation.

How to Use This Chapter

  1. Read the concepts — Understand why each tool exists

  2. Run the examples — Type them yourself, don’t copy-paste

  3. Modify the examples — Change patterns, try different data

  4. Do the lab — Apply everything to real problems

  5. Experiment — Build your own pipelines

This chapter is hands-on. You learn by doing.

A Quick Test

Before starting, you should be comfortable with:

# Navigate and list files
$ cd ~/projects
$ ls -la

# View file contents
$ cat file.txt
$ head file.txt

# Change permissions
$ chmod 755 script.sh

# Redirect output
$ command > output.txt

If any of these are unfamiliar, review Chapters 1-3 first.

Learning Outcomes

By the end of this chapter, you will be able to:

  • ✓ Explain the Unix philosophy of “everything is text”

  • ✓ Understand and use standard input/output/error streams

  • ✓ View and inspect file contents with appropriate tools

  • ✓ Search files using grep and regular expressions

  • ✓ Transform and manipulate text data with sed and awk

  • ✓ Build data processing pipelines using pipes and redirection

  • ✓ Combine multiple tools to solve real data analysis tasks

  • ✓ Debug command pipelines and troubleshoot data processing

  • ✓ Analyze real-world logs and structured data

  • ✓ Write efficient one-liners for complex tasks

Progression Strategy

0402 establishes the philosophy: everything flows as text through pipes.

0403 gives you tools to see what’s in files.

0404 teaches you to find the data you need.

0405 shows you how to shape and transform it.

0406 brings it all together: building efficient pipelines.

0407 applies everything to real scenarios.

Each section builds on the previous. Don’t skip around.

Chapter Map

Section

Topic

Key Skills

0402

Everything Is A File

stdin/stdout/stderr, text philosophy, pipes

0403

Viewing & Inspecting Data

cat, head, tail, less, file, wc, hexdump

0404

Searching, Filtering, Parsing

grep, regular expressions, find

0405

Data Transformation Tools

sed, awk, cut, tr, sort, uniq

0406

Pipes & Redirection

>, >>, <, |, combining, debugging

0407

Lab: Text Pipeline Analysis

Hands-on exercises, real data analysis

What You'll Need

Prerequisites:

  • Familiarity with basic shell commands (Chapter 1)

  • Understanding of file system navigation (Chapter 2)

  • Knowledge of permissions and user contexts (Chapter 3)

Mindset:

  • Think in data flows, not files

  • Embrace small tools that do one thing well

  • Learn to read a complex pipeline from left to right

In This Chapter

You’ll learn:

  • Everything is a file: Understanding stdin, stdout, stderr, and pipes

  • Viewing data: Tools like cat, head, tail, less, and file

  • Searching and filtering: grep, regular expressions, and find

  • Transforming data: sed, awk, cut, tr, sort, and uniq

  • Building pipelines: Combining tools with pipes and redirection

  • Real-world analysis: Processing logs, data files, and structured text

Historical Context

These utilities (grep, sed, awk, etc.) were created in the 1970s-80s and haven’t been fundamentally replaced because they’re elegant solutions to fundamental problems. They represent the Unix philosophy:

  1. Do one thing well — Each tool has a focused purpose

  2. Work with text — Text is universal; binary is not

  3. Compose together — Pipe output to input seamlessly

  4. Be predictable — Consistent behavior across versions

This philosophy has proven so effective that modern tools (Python, jq, etc.) are built on the same principles.

Why These Tools Matter

In Chapters 1-3, you learned the shell basics: navigation, file management, and permissions. Those are the foundations. This chapter is where Unix becomes powerful.

Modern systems generate enormous amounts of data:

  • Web servers: Millions of requests per day

  • Applications: Continuous logs of operations

  • Data science: Gigabytes of raw data to process

  • System administration: Monitoring hundreds of machines

Without these tools, you’d need to write programs for every task. With them, you can solve complex problems with one-liners.

The Power of Text Processing

Consider these real-world scenarios:

Scenario 1: Log Analysis

# Find all errors in a 10GB log file, count them by type, generate report
$ grep "ERROR" app.log | sed 's/.*ERROR: //' | cut -d' ' -f1 | sort | uniq -c | sort -rn

With a single pipeline, you can process terabytes of data without loading it all into memory.

Scenario 2: Data Transformation

# Convert messy data to clean CSV
$ cat raw_data.txt | grep -v "^#" | sed 's/;/,/g' | awk -F, '{print $2, $1, $5}' > clean_data.csv

Unix tools can transform any text format into any other—no programming needed.

Scenario 3: System Monitoring

# Find processes consuming most memory
$ ps aux | sort -k4 -rn | head -10
# Find disk space by directory
$ du -sh */ | sort -h | tail -10

These utilities aren’t just powerful—they’re composable. You chain them together to build increasingly sophisticated solutions.

Real-World Scenario: Log Analysis and Reporting

Imagine you run a web application server and need to analyze logs to understand user behavior and performance issues. Your application generates a log file like this:

[2024-01-15 10:23:45] INFO: Request from 192.168.1.100 - GET /api/users (200) 145ms
[2024-01-15 10:23:46] ERROR: Database connection failed for 192.168.1.101 - POST /api/data (500) 5000ms
[2024-01-15 10:23:47] WARN: Cache miss on /products/list (202) 892ms
[2024-01-15 10:23:48] INFO: Request from 192.168.1.102 - GET /api/products (200) 234ms
[2024-01-15 10:23:49] ERROR: Timeout on external API call from 192.168.1.100 (503) 30000ms

You need to:

  1. Count errors by type

  2. Find slowest requests

  3. Identify problematic IP addresses

  4. Generate a summary report

Without these tools, you’d write a Python script. With them:

#!/bin/bash

log_file="app.log"

echo "=== LOG ANALYSIS REPORT ==="
echo

echo "Error Summary:"
grep "ERROR" "$log_file" | sed 's/.*ERROR: //' | cut -d' ' -f1 | sort | uniq -c

echo
echo "Slowest Requests (top 5):"
grep "INFO\|ERROR\|WARN" "$log_file" | \
  sed 's/.*(\([0-9]*\)ms)/\1/' | \
  sort -rn | head -5

echo
echo "Request Count by IP:"
grep "from" "$log_file" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | sort | uniq -c | sort -rn

This solution demonstrates:

  • grep: Find specific lines (errors, slow requests)

  • sed: Extract relevant data (error types, response times)

  • cut: Isolate columns

  • sort & uniq: Count and aggregate

  • pipes: Chain operations together

These techniques scale from kilobytes to terabytes.

Sample Script: Text Processing Utility Library

A complete utility library for common text processing tasks:

#!/bin/bash

# lib/textutils.sh - Reusable text processing utilities

# Extract CSV column by name
csv_column() {
    local file="$1"
    local column_name="$2"
    
    # Find column number from header
    local col_num=$(head -1 "$file" | tr ',' '\n' | grep -n "^$column_name$" | cut -d: -f1)
    
    if [ -z "$col_num" ]; then
        echo "Column not found: $column_name" >&2
        return 1
    fi
    
    # Extract that column from data rows
    tail -n +2 "$file" | cut -d, -f"$col_num"
}

# Count occurrences of a pattern
count_pattern() {
    local file="$1"
    local pattern="$2"
    
    grep -c "$pattern" "$file"
}

# Extract lines between two patterns
extract_section() {
    local file="$1"
    local start="$2"
    local end="$3"
    
    sed -n "/$start/,/$end/p" "$file"
}

# Convert tab-separated to comma-separated
tsv_to_csv() {
    local file="$1"
    tr '\t' ',' < "$file"
}

# Extract specific fields from structured text
extract_fields() {
    local file="$1"
    shift
    local -a fields=("$@")
    
    awk -v fields="${fields[*]}" '
    BEGIN {
        n = split(fields, f)
        for (i = 1; i <= n; i++) print f[i]
    }' "$file"
}

# Remove empty lines and comments
clean_text() {
    local file="$1"
    grep -v '^\s*$' "$file" | grep -v '^\s*#'
}

# Sort by a specific column in CSV
sort_csv_by_column() {
    local file="$1"
    local col="$2"
    
    (head -1 "$file"; tail -n +2 "$file" | sort -t, -k"$col" -n)
}

# Calculate column statistics
column_stats() {
    local file="$1"
    local col="$2"
    
    # Assumes numeric column
    local data=$(tail -n +2 "$file" | cut -d, -f"$col")
    
    local sum=$(echo "$data" | paste -sd+ | bc)
    local count=$(echo "$data" | wc -l)
    local avg=$((sum / count))
    local max=$(echo "$data" | sort -n | tail -1)
    local min=$(echo "$data" | sort -n | head -1)
    
    echo "Sum: $sum"
    echo "Count: $count"
    echo "Average: $avg"
    echo "Max: $max"
    echo "Min: $min"
}

# Find lines with duplicate values in a field
find_duplicates() {
    local file="$1"
    local col="$2"
    
    cut -d, -f"$col" "$file" | sort | uniq -d
}

# Usage example
if [ "${BASH_SOURCE[0]}" == "${0}" ]; then
    # Demonstrate the utilities
    echo "Text Processing Utilities Loaded"
    echo "Usage: source textutils.sh"
    echo "Available functions:"
    declare -F | awk '{print "  - " $3}'
fi

What this library demonstrates:

  • Using pipes and text tools for data extraction

  • Pattern matching with grep and sed

  • Field extraction with cut and awk

  • Sorting and counting with sort/uniq

  • Combining tools for complex operations

  • Writing reusable utility functions

These are the patterns you’ll build throughout this chapter.