awk and grep Commands for Log Filtering
When a crawl-budget problem lands on your desk, the fastest path to an answer is rarely a dashboard — it is a grep and an awk against the raw access log on the box. This page is a working reference for filtering server logs from the command line: isolating a single search engine crawler, extracting the fields that matter, and aggregating dead paths and noisy IPs into a ranked list you can act on in minutes. It belongs to the broader CLI one-liners for quick audits toolkit, and the commands here are the building blocks that the more specialized recipes — like finding the top 404 URLs with awk and extracting the top bot user agents from logs — compose from.
The objective is narrow and practical: take a multi-gigabyte combined-format access log and, with no database and no ingestion pipeline, produce three things — a filtered stream of verified crawler hits, a ranked list of paths wasting crawl budget on 4xx/5xx responses, and a count of high-frequency IPs worth investigating. Each command below shows its expected output so you can confirm the pipeline behaves before you trust its numbers.
Diagnosis: Confirm the Log Format Before You Filter
Every field index in this guide assumes the combined log format. Before running anything, confirm your log actually matches it, because a custom log_format directive silently shifts $7 and $9 and corrupts every count downstream.
head -n 1 /var/log/nginx/access.log
Expected Output:
66.249.66.1 - - [19/Jun/2026:08:14:22 +0000] "GET /blog/ HTTP/1.1" 200 5120 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
In this layout, whitespace-split fields are: $1 = client IP, $4 = timestamp (with a leading [), $7 = request URI, $9 = HTTP status, $10 = bytes sent, and the final quoted token is the user-agent. Confirm your line matches this shape; if the URI sits somewhere other than $7, decode the format first using understanding HTTP status codes in server logs as the field reference, and adjust the indices accordingly.
Concept: Why grep First, Then awk
grep and awk overlap, but they have different strengths and the order matters for speed. grep -F does fixed-string matching with no regex engine and is the fastest way to discard the bulk of irrelevant lines. awk is a field-aware processor: once a line survives the filter, awk splits it on whitespace and lets you address columns, compare numbers, and accumulate counts in a single pass.
The performant pattern is therefore cheap filter, then field work: let grep -F (or zgrep -F) throw away 95% of the file as raw bytes, then hand the small survivor stream to awk for column extraction. Doing field extraction on every line with awk alone works, but forces the field-splitter to run on lines you were going to discard anyway. On a 10 GB log that difference is minutes. The companion CLI one-liners for quick audits cluster covers the streaming and compression techniques that keep these passes memory-flat.
Step-by-Step: Build the Core Filters
Step 1: Isolate a single crawler's requests.
Match the user-agent token with a fixed-string grep, then project the four fields that matter for crawl analysis.
grep -F "Googlebot" access.log \
| awk '{print $1, $4, $7, $9}' \
| sort | uniq -c | sort -nr | head -20
Expected Output:
412 66.249.66.1 [19/Jun/2026:08:14:22 /blog/ 200
188 66.249.66.4 [19/Jun/2026:08:15:01 /products/ 200
27 66.249.66.1 [19/Jun/2026:08:16:44 /old-page/ 404
The leading column is the occurrence count. A healthy crawler stream is dominated by 200 and 304; a visible band of 404 here is your first crawl-waste signal. Note that a user-agent string is trivially spoofed — treat this as a triage filter, not proof of identity, and confirm real crawlers with the reverse-DNS workflow in identifying search engine bots in server logs.
Step 2: Rank the paths returning errors.
Skip grep entirely here — the status filter is a numeric comparison on a field, which is awk's job. Match any 4xx or 5xx code, then aggregate by URI.
awk '$9 ~ /^[45][0-9][0-9]$/ {print $7}' access.log \
| sort | uniq -c | sort -nr | head -15
Expected Output:
1043 /old-product-page
612 /category/discontinued
288 /wp-login.php
Explanation: $9 ~ /^[45][0-9][0-9]$/ keeps only lines whose status field starts with 4 or 5 and is exactly three digits, avoiding false matches on byte counts. The result is a prioritized worklist: the top URIs are where redirects or 410s will reclaim the most budget. This is the foundation that finding the top 404 URLs with awk extends with status-specific breakdowns.
Step 3: Flag high-frequency IPs from compressed archives.
Rotated logs are usually gzipped. zgrep/zcat stream them without a manual decompress, keeping the pass memory-flat. Here we count successful requests per IP and surface any address exceeding 500 hits in the archive window.
zgrep -F " 200 " access.log.gz \
| awk '{print $1}' \
| sort | uniq -c \
| awk '$1 > 500 {print $2, $1}'
Expected Output:
66.249.66.1 1820
203.0.113.55 740
Explanation: the first awk projects the IP, sort | uniq -c counts occurrences, and the second awk keeps only IPs above the threshold. A legitimate crawler IP (Google's 66.249.x.x range) at the top is expected; an unfamiliar address at high volume is a candidate for rate-limiting or a deeper look with the user-agent breakdown in extracting the top bot user agents from logs.
Edge Cases and Gotchas
Gotcha 1: a custom log_format shifts your field indices.
If your nginx config defines a non-default log_format — for example, prepending $host or $request_time — then $7 is no longer the URI and every count silently lies. The robust fix is to anchor on the quoted request string rather than positional fields:
awk -F'"' '{split($2, r, " "); print r[2]}' access.log | head -3
Expected Output:
/blog/
/products/
/old-page/
Splitting on the double-quote delimiter isolates the "GET /path HTTP/1.1" token regardless of how many fields precede it, then split pulls the URI out of it. Use this whenever you cannot guarantee combined format.
Gotcha 2: grep -P backtracking stalls on multi-GB logs.
A Perl-regex pattern like grep -P "(Googlebot|bingbot).*\d+" can trigger catastrophic backtracking on long lines, pinning a CPU for minutes. For literal tokens use grep -F; when you genuinely need alternation, use grep -E with an anchored ERE and no unbounded .*. Reserve -P for cases where ERE truly cannot express the pattern.
Verification
Confirm a filter end-to-end before trusting it: a crawler stream should be overwhelmingly successful responses. This counts the status-code distribution for Googlebot in one pass.
grep -F "Googlebot" access.log \
| awk '{count[$9]++} END {for (s in count) print s, count[s]}' \
| sort -k2 -nr
Expected Output:
200 9841
304 1203
404 88
301 14
If 200/304 dominate and 4xx is a thin tail, the filter is sound and your crawl health is good. If 4xx rivals 2xx, you have real crawl-budget waste — feed the Step 2 list into your redirect plan. Cross-check the total against Google Search Console crawl stats for the same window; they should be the same order of magnitude.
Common Mistakes
- Treating the user-agent as proof of identity. A
grep -F "Googlebot"filter catches spoofers too. Use it to triage, then verify the actual crawler IPs by reverse DNS before acting on the numbers; the procedure lives in identifying search engine bots in server logs. - Matching bare numbers without anchoring.
grep "404"also matches a404-byte response or a/page404URI. Always anchor the status withawk '$9 == "404"'or the regex/^404$/so you filter on the field, not a substring. - Re-running multiple awk passes on the uncompressed file. Each extra
cat | awkre-reads the whole log. Consolidate projection, filtering, and counting into a singleawk '{...}'block — as the Verification command does with one associative array — to halve the I/O.
Frequently Asked Questions
How do I keep awk and grep from exhausting memory on 10GB+ logs?
Stream, never load. awk already processes line-by-line with its default record separator, so the only memory it holds is whatever you accumulate in arrays — keep those keyed on bounded values (status codes, IPs) rather than unbounded ones (full URLs with query strings). Pre-filter with grep -F before awk, read compressed archives directly with zgrep/zcat, and pipe straight into sort/uniq instead of writing intermediate files.
Can these one-liners actually quantify crawl budget waste?
Yes, within their scope. Filtering verified-crawler 200/304 responses against 404/5xx paths gives you a concrete count of wasted requests and the exact URIs causing them. Exclude static assets (CSS/JS/images) unless they are blocking indexation, and validate the totals against Search Console crawl stats. For trend analysis over time you will want a real store, but for a point-in-time audit these commands are sufficient.
How do I handle combined versus common log format in awk?
For combined logs the indices are $1 IP, $4 timestamp, $7 URI, $9 status. The common (CLF) format omits the referer and user-agent, so those trailing fields simply do not exist — the leading fields are unchanged. When you are unsure, parse the quoted request with awk -F'"' '{print $2}' to recover the method and URI reliably regardless of how many leading fields precede it.
Related Guides
- Finding the Top 404 URLs with awk — extends the error-path aggregation here into a status-specific crawl-waste worklist.
- Extracting the Top Bot User Agents from Logs — rank the agents hitting your site to separate real crawlers from scrapers.
- Identifying Search Engine Bots in Server Logs — verify a crawler's identity by reverse DNS before trusting a user-agent filter.
- Understanding HTTP Status Codes in Server Logs — interpret the status classes your awk filters count.
Part of the CLI One-Liners for Quick Audits series.