Finding Redirect Chains in Server Logs with awk
A redirect chain is any URL that resolves through two or more hops before reaching a 200 OK final destination. Each extra hop costs latency, dilutes link equity, and burns crawl budget — Googlebot has to issue a fresh request for every 301/302 it follows. The good news is that your access log already records every hop a crawler made; you just have to reconstruct the sequence. This guide shows how to surface 3xx responses with awk and grep, rank the worst offenders, and stitch single-request lines back into multi-hop chains so you can fix them at the source as part of broader redirect chain optimization.
What you will accomplish:
- Isolate and count
3xxresponses by requested URL straight from a combined-format log - Reconstruct multi-hop chains by correlating a single crawler's sequential requests
- Flag URLs whose redirect target is itself a redirect — the true chain offenders
Diagnosis: Confirming Redirects Exist in the Log
Start by confirming the access log actually carries 3xx responses and is in the expected combined format. In the standard Nginx/Apache combined layout, field $7 is the requested path and field $9 is the status code. A quick frequency sweep tells you whether redirects are a rounding error or a systemic problem.
awk '$9 ~ /^3[0-9][0-9]$/ {print $9}' access.log | sort | uniq -c | sort -nr
Expected Output:
48213 301
9044 302
312 304
27 307
If 301 and 302 counts run into the thousands while your site has only a handful of intentional redirects, crawlers are looping through stale paths. The 304 Not Modified responses are benign conditional-GET cache hits — ignore them. To understand exactly what each code means before acting, review HTTP status codes in server logs; a 301 and a 302 behave very differently for indexation.
Concept: Why a Single Line Is Not a Chain
A redirect chain is A → B → C → 200, but a log line only ever records one hop: the request for A and the 301 the server returned. The Location header — the target the crawler was sent to — is not in the default access log. So you reconstruct chains in one of two ways. First, by correlating sequential requests from the same client: a crawler that requests A, gets a 301, then requests B a moment later from the same IP and user-agent reveals the A → B edge through timing and ordering. Second, by logging the Location header so each hop's target is explicit. The second approach is far more reliable, so add it if you control the server config.
Step-by-Step: Surfacing and Reconstructing Chains
Step 1: Rank the top redirected URLs. Find which requested paths most often return a redirect. These are your highest-impact fixes because they consume the most crawler requests.
awk '$9 ~ /^30[12]$/ {print $9, $7}' access.log \
| sort | uniq -c | sort -nr | head -20
Expected Output:
12877 301 /blog/old-category
8901 301 /products?ref=email
4502 302 /login
991 301 /about-us
Explanation: Each line is count status path. The first row means /blog/old-category was requested 12,877 times and returned a 301 every time — a single legacy URL bleeding crawl budget. Fix the inbound links and the canonical so crawlers stop requesting the pre-redirect URL at all.
Step 2: Capture the Location header so targets become visible. The default log format hides where each redirect points. Add the upstream/redirect target to a dedicated log so chains can be read directly. In Nginx, define a format that appends $sent_http_location:
log_format redirects '$remote_addr "$request" $status -> "$sent_http_location" "$http_user_agent"';
access_log /var/log/nginx/redirects.log redirects if=$is_redirect;
map $status $is_redirect {
~^30[12] 1;
default 0;
}
Production Warning: Adding a log_format and map block edits live config. Validate with nginx -t before reloading, and confirm the new log path is writable by the Nginx user. Test the reload during low traffic — see fixing 301 redirect loops in Nginx for safe reload procedure.
Expected Output (a line in redirects.log):
66.249.66.1 "GET /blog/old-category HTTP/1.1" 301 -> "https://example.com/blog/news" "Googlebot/2.1"
Step 3: Flag URLs that redirect to other redirects. This is the core chain detector. Once you have the request -> location pairs from Step 2, an awk script can hold the full set of redirect sources in memory, then report any redirect whose target is itself a redirect source — i.e. a chain of length two or more.
awk '
match($0, /"GET ([^ ]+) HTTP/, req) &&
match($0, /-> "([^"]+)"/, loc) {
# strip scheme+host from the Location to compare against request paths
dest = loc[1]; sub(/^https?:\/\/[^\/]+/, "", dest)
src[req[1]] = dest # remember each redirect source -> its target
}
END {
for (u in src) {
t = src[u]
if (t in src) # target is itself a redirect source = chain
printf "CHAIN: %s -> %s -> %s\n", u, t, src[t]
}
}
' redirects.log
Expected Output:
CHAIN: /blog/old-category -> /blog/news -> /blog/latest-news
CHAIN: /shop -> /products -> /products/
Explanation: The match(...) calls use GNU awk's third-argument capture to pull the request path and the Location target. The script normalizes the target to a root-relative path, stores every source -> target edge, then in END reports any edge whose target also appears as a source. Each CHAIN: line is a two-hop redirect you can collapse to a single hop. Re-run after each fix; the list should shrink toward empty.
Edge-Case Handling
Mixing absolute and relative Location values. Some apps emit Location: /products/, others emit Location: https://example.com/products/. If you compare them raw, /products (source) and https://example.com/products/ (target) never match and a real chain hides. The sub(/^https?:\/\/[^\/]+/, "", dest) line in Step 3 strips the scheme and host so both forms reduce to the same path — keep it. Also watch trailing slashes: /shop and /shop/ are distinct keys, and a slash-only redirect is one of the most common accidental chains.
Non-GNU awk has no capture groups. The match($0, /regex/, arr) array form is a gawk extension. On BSD/macOS awk it silently fails. Confirm with awk --version (GNU prints "GNU Awk"); if absent, install gawk or pre-extract fields with grep -oE before piping into awk. For day-to-day filtering technique, see the awk and grep commands for log filtering reference.
Verification: Confirm the Chain Is Gone
After collapsing a chain to a single hop, confirm the live response. curl -IL follows every redirect and prints each status, so a fixed URL should show exactly one 301 then a 200.
curl -sIL https://example.com/blog/old-category | grep -E '^HTTP|^[Ll]ocation'
Expected Output:
HTTP/2 301
location: https://example.com/blog/latest-news
HTTP/2 200
One 301 followed by one 200 — no intermediate hop. If you still see two 301 lines before the 200, the chain persists and the source map from Step 3 will still list it.
Common Mistakes
- Counting
304as a redirect.304 Not Modifiedis a conditional-cache response, not a redirect. The30[12]pattern excludes it deliberately; a broad^3pattern inflates your numbers and sends you chasing phantom chains. - Treating per-line
301s as chains. A single301line is just one hop. Without the Location header (Step 2) or sequential-request correlation, you cannot prove a chain exists — you only know a redirect happened. Always reconstruct the edge before claiming a chain. - Ignoring query-string variants.
/products?ref=emailand/products?ref=adsare different keys toawkbut redirect to the same canonical. Strip query strings withsub(/\?.*/, "", path)before grouping, or you will under-count the real offender and over-count noise.
Frequently Asked Questions
Can I find redirect chains without changing my log format?
Partially. You can rank top redirected URLs (Step 1) from any combined-format log, which already pinpoints high-cost paths. But reconstructing the actual A → B → C edges reliably needs the Location header, which the default format omits. Sequential-request correlation by IP and timestamp works as a fallback but is noisy on busy crawlers.
How many hops is too many for crawl budget?
Google will follow up to about five hops before giving up, but every hop past the first wastes a request and slows indexation. Treat any chain of two or more as a defect to collapse to a single 301. The goal is always source → final 200 in one hop.
Why filter by Googlebot specifically when analyzing chains?
Crawl budget is spent by search bots, not human visitors. Isolating Googlebot (ideally after verifying it is genuine) shows which chains actually drain your budget versus which only affect a handful of users. Pair this with finding the top 404 URLs with awk to see the full picture of wasted crawler requests.
Related Guides
- Fixing 301 Redirect Loops in Nginx — collapse the chains you found into a single safe hop
- Understanding HTTP Status Codes in Server Logs — why
301and302differ for indexation - awk and grep Commands for Log Filtering — the field-extraction patterns used here
- Finding the Top 404 URLs with awk — the companion technique for dead-end crawl waste
Part of the Redirect Chain Optimization series.