4.4. Data Tools#
Now you can find data. Time to transform it. This section covers sed and awk—the most powerful text transformation tools available.
4.4.1. Common Pitfalls#
4.4.1.1. ❌ Wrong field separator in awk#
# CSV with commas, but default separator is whitespace
$ awk '{print $2}' data.csv # WRONG
# Specify comma separator
$ awk -F, '{print $2}' data.csv # RIGHT
4.4.1.2. ❌ Forgetting to sort before uniq#
# WRONG: uniq only removes consecutive duplicates
$ uniq file.txt
# RIGHT: Sort first
$ sort file.txt | uniq
4.4.2. Pitfalls#
4.4.2.1. ❌ Assuming sed changes are saved#
# This does NOT save changes
$ sed 's/error/ERROR/' file.txt
# Changes are printed, file unchanged
# Use -i to save in place
$ sed -i 's/error/ERROR/' file.txt
4.4.2.2. Multi-line Transformation#
$ sed 's/old/new/g' file.txt | awk '{print $1, $NF}' | sort -u
4.4.2.3. Conditional Processing#
$ awk -F, '{if ($3 > threshold) print $1}' data.csv
4.4.3. Common Patterns#
4.4.3.1. Extract and Count#
$ grep "pattern" file.txt | awk '{print $2}' | sort | uniq -c | sort -rn
4.4.4. Quick Reference#
Tool |
Purpose |
Example |
|---|---|---|
|
Stream substitution |
|
|
Pattern-action processing |
|
|
Character translation |
|
|
Sort lines |
|
|
Deduplicate |
|
|
Extract columns |
|
4.4.5. Putting It Together: A Data Pipeline#
# Scenario: Analyze web server logs
# Extract IPs, count requests, sort by frequency
$ awk '{print $1}' access.log \
| sort \
| uniq -c \
| sort -rn
152 192.168.1.1
89 192.168.1.2
34 10.0.0.5
12 172.16.0.1
# Scenario: Convert CSV format
# Input: name,age,city
# Output: name: age-year-old, from city
$ awk -F, '{print $1 ": " $2 "-year-old, from " $3}' people.csv
Alice: 28-year-old, from Portland
Bob: 35-year-old, from Seattle
4.4.6. uniq: Find/Remove Duplicates#
Remove or count repeated lines:
# Remove duplicate consecutive lines
$ uniq file.txt
# Count occurrences
$ uniq -c file.txt
1 apple
3 banana
2 cherry
# Show only duplicates
$ uniq -d file.txt
banana
cherry
# Show only unique lines
$ uniq -u file.txt
apple
4.4.7. sort: Arrange Data#
Sort lines by various criteria:
# Alphabetical sort
$ sort file.txt
# Numeric sort
$ sort -n numbers.txt
# 1, 2, 10, 20 (not 1, 10, 2, 20)
# Reverse sort
$ sort -r file.txt
# Sort by field
$ sort -k 2 file.txt
# Sort by second field
# Sort CSV by third column (numeric)
$ sort -t, -k3 -n data.csv
# -t, = comma separator
# -k3 = third field
# -n = numeric
4.4.8. tr: Translate Characters#
Simple character translation:
# Convert lowercase to uppercase
$ tr 'a-z' 'A-Z' < file.txt
# Remove characters
$ tr -d '[:space:]' < file.txt
# Removes all whitespace
# Replace characters
$ tr ';' ',' < data.txt
# Replace semicolons with commas (CSV conversion)
# Extract only digits
$ tr -cd '[:digit:]' < file.txt
4.4.8.1. Field Separator#
# CSV file (comma-separated)
$ awk -F, '{print $2}' data.csv
# Tab-separated
$ awk -F'\t' '{print $1, $3}' data.tsv
# Multiple character separator
$ awk -F':' '{print $1}' /etc/passwd
# Whitespace (default)
$ awk '{print $1}' file.txt # Auto-splits on whitespace
4.4.8.2. Pattern Examples#
# Print lines where age (field 3) > 30
$ awk '$3 > 30 {print}' people.csv
# Print lines matching regex
$ awk '/error/ {print}' app.log
# Multiple patterns
$ awk '/ERROR/ {errors++} /WARNING/ {warnings++} END {print errors, warnings}' app.log
# Range of lines
$ awk '/START/,/END/ {print}' file.txt
# Print from line matching START to line matching END
4.4.8.3. Real-World Examples#
# Extract usernames from /etc/passwd
$ awk -F: '{print $1}' /etc/passwd
# -F: = use colon as field separator
root
daemon
alice
# Parse web server logs (space-separated)
# 192.168.1.1 - - [15/Jan/2025:10:30:45] "GET /index.html" 200 1234
$ awk '{print $1}' access.log
# IP addresses:
192.168.1.1
192.168.1.2
10.0.0.5
# Count requests per IP
$ awk '{print $1}' access.log | sort | uniq -c
42 192.168.1.1
15 192.168.1.2
8 10.0.0.5
# Sum file sizes (last column)
$ awk '{sum += $NF} END {print "Total: " sum " bytes"}' access.log
# Filter and reformat
$ awk -F, '{if ($3 > 25) print $1 ", age " $3}' people.csv
4.4.8.4. Simple Usage#
# Print entire file (default action)
$ awk '{print}' file.txt
# Same as: cat file.txt
# Print specific column (space-separated)
$ awk '{print $1}' file.txt
# $1 = first field
# $2 = second field
# $NF = last field
# Print with condition
$ awk '$1 > 10 {print}' numbers.txt
# Print lines where first field > 10
# Sum numbers in first column
$ awk '{sum += $1} END {print sum}' numbers.txt
# BEGIN = run before processing
# END = run after processing
4.4.8.5. Basic Syntax#
awk 'pattern { action }' file.txt
4.4.9. awk: Pattern-Action Language#
awk is more powerful—it’s a programming language for text processing.
4.4.9.1. More sed Commands#
# Delete lines matching pattern
$ sed '/pattern/d' file.txt
# Delete lines 5-10
$ sed '5,10d' file.txt
# Print only lines 1-5
$ sed -n '1,5p' file.txt
# Insert text before line
$ sed '/pattern/i\NEW LINE' file.txt
# Append text after line
$ sed '/pattern/a\NEW LINE' file.txt
4.4.9.2. Real-World Examples#
# Fix a common typo throughout a document
$ sed -i 's/recieve/receive/g' document.txt
# Remove leading/trailing whitespace
$ sed 's/^[ \t]*//;s/[ \t]*$//' file.txt
# Convert DOS line endings to Unix
$ sed 's/\r$//' dos_file.txt > unix_file.txt
# Add line numbers
$ sed 's/^/[LINE] /' file.txt
# Adds "[LINE] " to start of each line
# Extract version from package info
$ sed -n 's/Version: \(.*\)/\1/p' package.info
# \(.*\) captures content
# \1 refers to first capture group
4.4.9.3. Substitution (s command)#
The most common operation:
# Replace first "error" with "ERROR"
$ sed 's/error/ERROR/' file.txt
# s = substitute
# / = delimiter
# error = find this
# ERROR = replace with this
# Replace ALL occurrences on each line
$ sed 's/error/ERROR/g' file.txt
# g = global (all on line, not just first)
# Case-insensitive replacement
$ sed 's/error/ERROR/i' file.txt
# Show only lines that changed
$ sed -n 's/error/ERROR/p' file.txt
# -n = quiet mode
# p = print changed lines
# Replace and save to file
$ sed 's/error/ERROR/g' file.txt > output.txt
# Or edit in place:
$ sed -i 's/error/ERROR/g' file.txt # Modifies file!
4.4.9.4. Basic Syntax#
sed [options] 'command' file.txt
4.4.10. sed: Stream Editor#
sed modifies text streams without opening an editor.