Data Tools

4.4. Data Tools#

Now you can find data. Time to transform it. This section covers sed and awk—the most powerful text transformation tools available.

4.4.1. Common Pitfalls#

4.4.1.1. ❌ Wrong field separator in awk#

# CSV with commas, but default separator is whitespace
$ awk '{print $2}' data.csv  # WRONG

# Specify comma separator
$ awk -F, '{print $2}' data.csv  # RIGHT

4.4.1.2. ❌ Forgetting to sort before uniq#

# WRONG: uniq only removes consecutive duplicates
$ uniq file.txt

# RIGHT: Sort first
$ sort file.txt | uniq

4.4.2. Pitfalls#

4.4.2.1. ❌ Assuming sed changes are saved#

# This does NOT save changes
$ sed 's/error/ERROR/' file.txt
# Changes are printed, file unchanged

# Use -i to save in place
$ sed -i 's/error/ERROR/' file.txt

4.4.2.2. Multi-line Transformation#

$ sed 's/old/new/g' file.txt | awk '{print $1, $NF}' | sort -u

4.4.2.3. Conditional Processing#

$ awk -F, '{if ($3 > threshold) print $1}' data.csv

4.4.3. Common Patterns#

4.4.3.1. Extract and Count#

$ grep "pattern" file.txt | awk '{print $2}' | sort | uniq -c | sort -rn

4.4.4. Quick Reference#

Tool	Purpose	Example
`sed`	Stream substitution	`sed 's/old/new/g'`
`awk`	Pattern-action processing	`awk '{print $1, $3}'`
`tr`	Character translation	`tr 'a-z' 'A-Z'`
`sort`	Sort lines	`sort -n file.txt`
`uniq`	Deduplicate	`sort \| uniq -c`
`cut`	Extract columns	`cut -d: -f1`

4.4.5. Putting It Together: A Data Pipeline#

# Scenario: Analyze web server logs
# Extract IPs, count requests, sort by frequency

$ awk '{print $1}' access.log \
  | sort \
  | uniq -c \
  | sort -rn

    152 192.168.1.1
     89 192.168.1.2
     34 10.0.0.5
     12 172.16.0.1

# Scenario: Convert CSV format
# Input: name,age,city
# Output: name: age-year-old, from city

$ awk -F, '{print $1 ": " $2 "-year-old, from " $3}' people.csv
Alice: 28-year-old, from Portland
Bob: 35-year-old, from Seattle

4.4.6. uniq: Find/Remove Duplicates#

Remove or count repeated lines:

# Remove duplicate consecutive lines
$ uniq file.txt

# Count occurrences
$ uniq -c file.txt
  1 apple
  3 banana
  2 cherry

# Show only duplicates
$ uniq -d file.txt
banana
cherry

# Show only unique lines
$ uniq -u file.txt
apple

4.4.7. sort: Arrange Data#

Sort lines by various criteria:

# Alphabetical sort
$ sort file.txt

# Numeric sort
$ sort -n numbers.txt
# 1, 2, 10, 20 (not 1, 10, 2, 20)

# Reverse sort
$ sort -r file.txt

# Sort by field
$ sort -k 2 file.txt
# Sort by second field

# Sort CSV by third column (numeric)
$ sort -t, -k3 -n data.csv
# -t, = comma separator
# -k3 = third field
# -n = numeric

4.4.8. tr: Translate Characters#

Simple character translation:

# Convert lowercase to uppercase
$ tr 'a-z' 'A-Z' < file.txt

# Remove characters
$ tr -d '[:space:]' < file.txt
# Removes all whitespace

# Replace characters
$ tr ';' ',' < data.txt
# Replace semicolons with commas (CSV conversion)

# Extract only digits
$ tr -cd '[:digit:]' < file.txt

4.4.8.1. Field Separator#

# CSV file (comma-separated)
$ awk -F, '{print $2}' data.csv

# Tab-separated
$ awk -F'\t' '{print $1, $3}' data.tsv

# Multiple character separator
$ awk -F':' '{print $1}' /etc/passwd

# Whitespace (default)
$ awk '{print $1}' file.txt  # Auto-splits on whitespace

4.4.8.2. Pattern Examples#

# Print lines where age (field 3) > 30
$ awk '$3 > 30 {print}' people.csv

# Print lines matching regex
$ awk '/error/ {print}' app.log

# Multiple patterns
$ awk '/ERROR/ {errors++} /WARNING/ {warnings++} END {print errors, warnings}' app.log

# Range of lines
$ awk '/START/,/END/ {print}' file.txt
# Print from line matching START to line matching END

4.4.8.3. Real-World Examples#

# Extract usernames from /etc/passwd
$ awk -F: '{print $1}' /etc/passwd
# -F: = use colon as field separator
root
daemon
alice

# Parse web server logs (space-separated)
# 192.168.1.1 - - [15/Jan/2025:10:30:45] "GET /index.html" 200 1234
$ awk '{print $1}' access.log
# IP addresses:
192.168.1.1
192.168.1.2
10.0.0.5

# Count requests per IP
$ awk '{print $1}' access.log | sort | uniq -c
   42 192.168.1.1
   15 192.168.1.2
   8 10.0.0.5

# Sum file sizes (last column)
$ awk '{sum += $NF} END {print "Total: " sum " bytes"}' access.log

# Filter and reformat
$ awk -F, '{if ($3 > 25) print $1 ", age " $3}' people.csv

4.4.8.4. Simple Usage#

# Print entire file (default action)
$ awk '{print}' file.txt
# Same as: cat file.txt

# Print specific column (space-separated)
$ awk '{print $1}' file.txt
# $1 = first field
# $2 = second field
# $NF = last field

# Print with condition
$ awk '$1 > 10 {print}' numbers.txt
# Print lines where first field > 10

# Sum numbers in first column
$ awk '{sum += $1} END {print sum}' numbers.txt
# BEGIN = run before processing
# END = run after processing

4.4.8.5. Basic Syntax#

awk 'pattern { action }' file.txt

4.4.9. awk: Pattern-Action Language#

awk is more powerful—it’s a programming language for text processing.

4.4.9.1. More sed Commands#

# Delete lines matching pattern
$ sed '/pattern/d' file.txt

# Delete lines 5-10
$ sed '5,10d' file.txt

# Print only lines 1-5
$ sed -n '1,5p' file.txt

# Insert text before line
$ sed '/pattern/i\NEW LINE' file.txt

# Append text after line
$ sed '/pattern/a\NEW LINE' file.txt

4.4.9.2. Real-World Examples#

# Fix a common typo throughout a document
$ sed -i 's/recieve/receive/g' document.txt

# Remove leading/trailing whitespace
$ sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

# Convert DOS line endings to Unix
$ sed 's/\r$//' dos_file.txt > unix_file.txt

# Add line numbers
$ sed 's/^/[LINE] /' file.txt
# Adds "[LINE] " to start of each line

# Extract version from package info
$ sed -n 's/Version: \(.*\)/\1/p' package.info
# \(.*\) captures content
# \1 refers to first capture group

4.4.9.3. Substitution (s command)#

The most common operation:

# Replace first "error" with "ERROR"
$ sed 's/error/ERROR/' file.txt
# s = substitute
# / = delimiter
# error = find this
# ERROR = replace with this

# Replace ALL occurrences on each line
$ sed 's/error/ERROR/g' file.txt
# g = global (all on line, not just first)

# Case-insensitive replacement
$ sed 's/error/ERROR/i' file.txt

# Show only lines that changed
$ sed -n 's/error/ERROR/p' file.txt
# -n = quiet mode
# p = print changed lines

# Replace and save to file
$ sed 's/error/ERROR/g' file.txt > output.txt
# Or edit in place:
$ sed -i 's/error/ERROR/g' file.txt  # Modifies file!

4.4.9.4. Basic Syntax#

sed [options] 'command' file.txt

4.4.10. sed: Stream Editor#

sed modifies text streams without opening an editor.