Home
Crawl Budget Optimization & Bot Management
Redirect Chain Optimization
Finding Redirect Chains in Server Logs with awk

Finding Redirect Chains in Server Logs with awk

A redirect chain is any URL that resolves through two or more hops before reaching a 200 OK final destination. Each extra hop costs latency, dilutes link equity, and burns crawl budget — Googlebot has to issue a fresh request for every 301/302 it follows. The good news is that your access log already records every hop a crawler made; you just have to reconstruct the sequence. This guide shows how to surface 3xx responses with awk and grep, rank the worst offenders, and stitch single-request lines back into multi-hop chains so you can fix them at the source as part of broader redirect chain optimization.

What you will accomplish:

Isolate and count 3xx responses by requested URL straight from a combined-format log
Reconstruct multi-hop chains by correlating a single crawler's sequential requests
Flag URLs whose redirect target is itself a redirect — the true chain offenders

Diagnosis: Confirming Redirects Exist in the Log

Start by confirming the access log actually carries 3xx responses and is in the expected combined format. In the standard Nginx/Apache combined layout, field $7 is the requested path and field $9 is the status code. A quick frequency sweep tells you whether redirects are a rounding error or a systemic problem.

awk '$9 ~ /^3[0-9][0-9]$/ {print $9}' access.log | sort | uniq -c | sort -nr

Expected Output:

If 301 and 302 counts run into the thousands while your site has only a handful of intentional redirects, crawlers are looping through stale paths. The 304 Not Modified responses are benign conditional-GET cache hits — ignore them. To understand exactly what each code means before acting, review HTTP status codes in server logs; a 301 and a 302 behave very differently for indexation.

Concept: Why a Single Line Is Not a Chain

A redirect chain is A → B → C → 200, but a log line only ever records one hop: the request for A and the 301 the server returned. The Location header — the target the crawler was sent to — is not in the default access log. So you reconstruct chains in one of two ways. First, by correlating sequential requests from the same client: a crawler that requests A, gets a 301, then requests B a moment later from the same IP and user-agent reveals the A → B edge through timing and ordering. Second, by logging the Location header so each hop's target is explicit. The second approach is far more reliable, so add it if you control the server config.

Step-by-Step: Surfacing and Reconstructing Chains

Step 1: Rank the top redirected URLs. Find which requested paths most often return a redirect. These are your highest-impact fixes because they consume the most crawler requests.

awk '$9 ~ /^30[12]$/ {print $9, $7}' access.log \
  | sort | uniq -c | sort -nr | head -20

Expected Output:

  12877 301 /blog/old-category
   8901 301 /products?ref=email
   4502 302 /login
    991 301 /about-us

Explanation: Each line is count status path. The first row means /blog/old-category was requested 12,877 times and returned a 301 every time — a single legacy URL bleeding crawl budget. Fix the inbound links and the canonical so crawlers stop requesting the pre-redirect URL at all.

Step 2: Capture the Location header so targets become visible. The default log format hides where each redirect points. Add the upstream/redirect target to a dedicated log so chains can be read directly. In Nginx, define a format that appends $sent_http_location:

log_format redirects '$remote_addr "$request" $status -> "$sent_http_location" "$http_user_agent"';
access_log /var/log/nginx/redirects.log redirects if=$is_redirect;

map $status $is_redirect {
    ~^30[12] 1;
    default  0;
}

Production Warning: Adding a log_format and map block edits live config. Validate with nginx -t before reloading, and confirm the new log path is writable by the Nginx user. Test the reload during low traffic — see fixing 301 redirect loops in Nginx for safe reload procedure.

Expected Output (a line in redirects.log):

66.249.66.1 "GET /blog/old-category HTTP/1.1" 301 -> "https://example.com/blog/news" "Googlebot/2.1"

Step 3: Flag URLs that redirect to other redirects. This is the core chain detector. Once you have the request -> location pairs from Step 2, an awk script can hold the full set of redirect sources in memory, then report any redirect whose target is itself a redirect source — i.e. a chain of length two or more.

awk '
  match($0, /"GET ([^ ]+) HTTP/, req) &&
  match($0, /-> "([^"]+)"/, loc) {
    # strip scheme+host from the Location to compare against request paths
    dest = loc[1]; sub(/^https?:\/\/[^\/]+/, "", dest)
    src[req[1]] = dest          # remember each redirect source -> its target
  }
  END {
    for (u in src) {
      t = src[u]
      if (t in src)             # target is itself a redirect source = chain
        printf "CHAIN: %s -> %s -> %s\n", u, t, src[t]
    }
  }
' redirects.log

Expected Output:

CHAIN: /blog/old-category -> /blog/news -> /blog/latest-news
CHAIN: /shop -> /products -> /products/

Explanation: The match(...) calls use GNU awk's third-argument capture to pull the request path and the Location target. The script normalizes the target to a root-relative path, stores every source -> target edge, then in END reports any edge whose target also appears as a source. Each CHAIN: line is a two-hop redirect you can collapse to a single hop. Re-run after each fix; the list should shrink toward empty.

Edge-Case Handling

Mixing absolute and relative Location values. Some apps emit Location: /products/, others emit Location: https://example.com/products/. If you compare them raw, /products (source) and https://example.com/products/ (target) never match and a real chain hides. The sub(/^https?:\/\/[^\/]+/, "", dest) line in Step 3 strips the scheme and host so both forms reduce to the same path — keep it. Also watch trailing slashes: /shop and /shop/ are distinct keys, and a slash-only redirect is one of the most common accidental chains.

Non-GNU awk has no capture groups. The match($0, /regex/, arr) array form is a gawk extension. On BSD/macOS awk it silently fails. Confirm with awk --version (GNU prints "GNU Awk"); if absent, install gawk or pre-extract fields with grep -oE before piping into awk. For day-to-day filtering technique, see the awk and grep commands for log filtering reference.

Verification: Confirm the Chain Is Gone

After collapsing a chain to a single hop, confirm the live response. curl -IL follows every redirect and prints each status, so a fixed URL should show exactly one 301 then a 200.

curl -sIL https://example.com/blog/old-category | grep -E '^HTTP|^[Ll]ocation'

Expected Output:

HTTP/2 301
location: https://example.com/blog/latest-news
HTTP/2 200

One 301 followed by one 200 — no intermediate hop. If you still see two 301 lines before the 200, the chain persists and the source map from Step 3 will still list it.

Common Mistakes

Counting 304 as a redirect. 304 Not Modified is a conditional-cache response, not a redirect. The 30[12] pattern excludes it deliberately; a broad ^3 pattern inflates your numbers and sends you chasing phantom chains.
Treating per-line 301s as chains. A single 301 line is just one hop. Without the Location header (Step 2) or sequential-request correlation, you cannot prove a chain exists — you only know a redirect happened. Always reconstruct the edge before claiming a chain.
Ignoring query-string variants. /products?ref=email and /products?ref=ads are different keys to awk but redirect to the same canonical. Strip query strings with sub(/\?.*/, "", path) before grouping, or you will under-count the real offender and over-count noise.

Frequently Asked Questions

Can I find redirect chains without changing my log format?
Partially. You can rank top redirected URLs (Step 1) from any combined-format log, which already pinpoints high-cost paths. But reconstructing the actual A → B → C edges reliably needs the Location header, which the default format omits. Sequential-request correlation by IP and timestamp works as a fallback but is noisy on busy crawlers.

How many hops is too many for crawl budget?
Google will follow up to about five hops before giving up, but every hop past the first wastes a request and slows indexation. Treat any chain of two or more as a defect to collapse to a single 301. The goal is always source → final 200 in one hop.

Why filter by Googlebot specifically when analyzing chains?
Crawl budget is spent by search bots, not human visitors. Isolating Googlebot (ideally after verifying it is genuine) shows which chains actually drain your budget versus which only affect a handful of users. Pair this with finding the top 404 URLs with awk to see the full picture of wasted crawler requests.

Fixing 301 Redirect Loops in Nginx — collapse the chains you found into a single safe hop
Understanding HTTP Status Codes in Server Logs — why 301 and 302 differ for indexation
awk and grep Commands for Log Filtering — the field-extraction patterns used here
Finding the Top 404 URLs with awk — the companion technique for dead-end crawl waste

Part of the Redirect Chain Optimization series.

Finding Redirect Chains in Server Logs with awk

Diagnosis: Confirming Redirects Exist in the Log #

Concept: Why a Single Line Is Not a Chain #

Step-by-Step: Surfacing and Reconstructing Chains #

Edge-Case Handling #

Verification: Confirm the Chain Is Gone #

Common Mistakes #

Frequently Asked Questions #

Related Guides #

Diagnosis: Confirming Redirects Exist in the Log

Concept: Why a Single Line Is Not a Chain

Step-by-Step: Surfacing and Reconstructing Chains

Edge-Case Handling

Verification: Confirm the Chain Is Gone

Common Mistakes

Frequently Asked Questions

Related Guides