Finding the Top 404 URLs with awk

A handful of high-traffic 404s can quietly burn crawl budget and frustrate users for months. Before you reach for a dashboard, a single awk pipeline ranks every missing URL by frequency straight from the access log, so you know exactly which broken paths to fix first. This guide walks the canonical one-liner, then segments those 404s by verified Googlebot versus humans and joins them back to the referrer that produced them. It is the surgical companion to the broader CLI One-Liners for Quick Audits cluster.

By the end you will rank 404 and 410 URLs by hit count, isolate the ones search engines actually crawl, trace the referring page causing each broken link, and assign a remediation priority (301 versus 410 versus leave-it). Every command runs against a standard Apache/Nginx combined log with no extra tooling.

Diagnosis: Confirm You Have a 404 Problem

In the combined log format, the HTTP status code is field $9 and the requested URI is field $7. A 404 line looks like this:

66.249.66.1 - - [18/Jun/2026:04:12:55 +0000] "GET /old-product-page HTTP/1.1" 404 564 "https://example.com/catalog" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Confirm the scale of the problem with a fast count of all 404 responses:

awk '$9 == 404' access.log | wc -l

Expected Output: A single integer, e.g. 4821. If this is more than a fraction of a percent of total lines, the ranking below is worth your time. For background on what each status code signals to a crawler, see understanding HTTP status codes in server logs.

Concept: Why a Few 404 URLs Dominate

404 responses are rarely uniform. A small number of dead URLs — a renamed product, a retired campaign landing page, a broken asset referenced site-wide — typically account for the overwhelming majority of 404 hits. The scattering of one-off 404s (typos, probing scanners) is noise. Ranking by frequency surfaces the handful of URLs where a single fix removes thousands of wasted requests, which is the entire point of doing this from the command line instead of eyeballing raw lines.

Step-by-Step: Rank 404 (and 410) URLs by Frequency

Step 1: The canonical 404 ranking one-liner
Filter on status 404, extract the URI, then count and sort descending.

awk '$9 == 404 {print $7}' access.log | sort | uniq -c | sort -nr | head -20

Explanation: $9 == 404 keeps only Not Found responses, print $7 emits the requested path, sort | uniq -c collapses duplicates into counts, and sort -nr | head -20 returns the 20 most-requested dead URLs.
Expected Output:

   1843 /old-product-page
    912 /campaigns/spring-sale
    604 /assets/logo-v1.png
    221 /blog/2019/?p=88

Step 2: Include 410 (Gone) responses in the same pass
A 410 is a deliberate "this is gone forever" signal. Auditing both together shows whether your intentional removals are still being hammered and whether any path you meant to retire is still returning 404 instead.

awk '$9 == 404 || $9 == 410 {print $9, $7}' access.log | sort | uniq -c | sort -nr | head -20

Explanation: Now the status code is printed alongside the URI, so uniq -c groups by the status-plus-path pair. You can instantly see which dead URLs are correctly returning 410 versus which are still leaking 404s.
Expected Output:

   1843 404 /old-product-page
    912 404 /campaigns/spring-sale
    410 410 /discontinued/widget-x

Step 3: Normalize query strings before counting
Query parameters fragment the count: /search?q=a and /search?q=b are tallied separately even though the base path is the same broken endpoint. Strip everything after the ? to aggregate by path.

awk '$9 == 404 {print $7}' access.log | awk -F'?' '{print $1}' | sort | uniq -c | sort -nr | head -20

Explanation: The second awk splits each URI on ? and keeps only the base path ($1), so parameterized variants of the same dead page collapse into one ranked entry. This mirrors the path-normalization pattern covered in awk and grep commands for log filtering.
Expected Output: The same ranking, but parameter-heavy paths now aggregate into a single, larger count.

Step-by-Step: Segment 404s by Verified Googlebot vs Humans

A 404 that only humans hit is a UX problem. A 404 that Googlebot hits repeatedly is a crawl-budget and indexing problem. They demand different urgency, so split the stream.

Step 4: Rank the 404s Googlebot is hitting
The user-agent lives in the last quoted field. Filter on the Googlebot token, then rank the dead paths it requests.

awk '$9 == 404 && /Googlebot/ {print $7}' access.log | sort | uniq -c | sort -nr | head -20

Explanation: && /Googlebot/ requires the whole line to contain the Googlebot token before the path is emitted, restricting the ranking to crawler-driven 404s.
Expected Output:

    721 /old-product-page
    540 /campaigns/spring-sale

Safety Note: The Googlebot string in the user-agent is trivially spoofed by scrapers. Before you act on "Googlebot" volume, verify the requester with a reverse-then-forward DNS check (dig -x <IP>); a spoofed agent will not resolve to a googlebot.com host. The full procedure lives in detecting fake Googlebot traffic in access logs.

Step 5: Verify Googlebot IPs, then re-rank against the clean set
Extract the IPs hitting 404s under the Googlebot UA, confirm they reverse-resolve to Google, and exclude the impostors.

awk '$9 == 404 && /Googlebot/ {print $1}' access.log | sort -u | \
while read ip; do
  host "$ip" | grep -q 'googlebot.com' && echo "$ip"
done > verified_googlebot_ips.txt
grep -Ff verified_googlebot_ips.txt access.log | awk '$9 == 404 {print $7}' | sort | uniq -c | sort -nr | head -20

Explanation: The loop keeps only IPs whose reverse DNS ends in googlebot.com, writes them to a file, and grep -Ff re-filters the log to that verified set before ranking. The result is the list of dead URLs genuine Googlebot is wasting crawl requests on.
Expected Output: A shorter, trustworthy ranking — typically smaller than Step 4 once spoofed agents are removed.

Step-by-Step: Join 404s to the Referrer Causing Them

The referrer (field $11 in combined format) tells you where the broken link lives. Fixing the source link is often better than redirecting the destination, because it stops the bad request at the origin.

Step 6: Pair each broken URL with its top referrers
Print the requested path and the referrer together, then rank the combinations.

awk '$9 == 404 {print $7, "<=", $11}' access.log | sort | uniq -c | sort -nr | head -20

Explanation: $11 is the quoted referrer. Grouping path-plus-referrer reveals which on-site (or off-site) page links to each dead URL. An internal referrer means you can fix the link in your own templates; an external one means the link is on a third-party site and the destination needs a redirect.
Expected Output:

    602 /old-product-page <= "https://example.com/catalog"
    310 /assets/logo-v1.png <= "https://example.com/"

Edge Case — empty referrers ("-"): Direct hits, bookmarks, and many bot requests log "-" as the referrer. Filter them out to focus on linkable sources: append && $11 != "\"-\"" to the condition. A high share of "-" referrers on a 404 usually means stale external links or indexed dead URLs rather than a fixable on-site link.

Edge Case — log format without a referrer field: The common log format (not combined) has no referrer or user-agent. Confirm your format first with head -1 access.log; if a line ends at the byte count after the status code, you cannot run Steps 4–6 until you switch the server to combined logging.

Verification: Confirm the Fix Landed

After deploying redirects or removing broken links, re-run the ranking against fresh logs and confirm the top offenders have dropped toward zero. The cleanest check counts the specific URL you fixed across the old and new status codes:

awk '$7 == "/old-product-page" {print $9}' access.log | sort | uniq -c

Expected Output (post-fix): 1843 301 instead of 1843 404 — the path now redirects rather than 404s. Once the redirect is confirmed, watch for chains: a 301 pointing to another 301 still wastes crawl budget, which you can detect with the technique in finding redirect chains in server logs with awk.

Remediation Priorities: 301 vs 410 vs Soft-404

Scenario Action Why
High-traffic 404 with a clear equivalent page 301 to the equivalent Preserves link equity and stops the waste
High-traffic 404 with no equivalent, gone for good 410 Gone Tells crawlers to drop it faster than a 404
404 driven by an internal broken link Fix the source link Removes the request at origin; no redirect needed
Low-volume one-off 404s (scanners, typos) Leave as 404 Correct behavior; redirecting them only adds bloat
Page returns 200 but is effectively empty/missing Investigate soft-404 A 200 that should be a 404 hides the problem from this audit

That last row is the dangerous one: a soft-404 returns 200 and never appears in any $9 == 404 filter, so it silently evades this entire workflow. Learn to surface those separately in detecting soft 404s in server logs. Once you know which bots are driving the rest of your error volume, rank them with extracting the top bot user agents from logs.

Common Mistakes

  • Counting query-string variants separately: Skipping the awk -F'?' normalization in Step 3 splits one broken endpoint across dozens of low rank entries, hiding its true impact. Always normalize before you trust the ranking.
  • Treating spoofed Googlebot as real: Acting on raw /Googlebot/ counts without the reverse-DNS verification in Step 5 leads you to prioritize crawl-budget fixes for traffic that is actually a scraper. Verify first.
  • Redirecting every 404 to the homepage: Blanket 301s to / create soft-404s — the page loads 200 but the content the user wanted is gone. Map each high-traffic 404 to a genuine equivalent or return 410 instead.

Frequently Asked Questions

Why use $9 == 404 instead of grep 404?
grep 404 matches the literal string 404 anywhere on the line — a byte count of 404, a path like /error-404, or a timestamp — producing false positives. Anchoring to field $9 with awk matches only the actual HTTP status code, giving an accurate count.

Should a removed page return 404 or 410?
Use 410 (Gone) when the page is permanently removed and has no replacement; search engines tend to drop 410 URLs from the index faster than 404s. Use 301 when an equivalent page exists so you keep the link equity. Reserve plain 404 for genuinely unexpected misses.

How do I find which internal page links to a broken URL?
Join the 404 to its referrer with awk '$9 == 404 {print $7, $11}' (Step 6). When the referrer is an on-site URL, that page contains the broken link in its template or content — fix the link there rather than redirecting the destination.

Part of the CLI One-Liners for Quick Audits series.