Detecting Soft 404s in Server Logs
A soft 404 is a page that returns HTTP 200 OK but is really empty, "not found," or thin — an out-of-stock product, a tag archive with zero posts, an internal search with no results. Search engines fetch it, find nothing of value, and waste crawl budget revisiting it; worse, they may quietly drop it from the index and flag it in Search Console. Because the status line says 200, status-code filtering alone can never find these pages. This guide shows how to surface soft 404s from raw access logs using response size as a proxy for emptiness, then fix them at the source. It is one of the trickier cases in diagnosing crawl budget waste precisely because the log's most useful field — the status code — lies to you.
The objective: use the body_bytes_sent field ($10 in combined format) to flag suspiciously small 200 responses, cross-reference them against known empty-result path patterns, and confirm the diagnosis by checking whether crawlers keep revisiting them. Everything runs on a standard combined-format log with awk.
Why Status Codes Can't See Soft 404s
A normal "not found" returns 404 or 410, and you can grep for it in one line. A soft 404 returns 200 because the application generated a valid page — the template rendered, the HTTP handler succeeded — it just had no real content to put in it. From the crawler's and the log's perspective, the request was a complete success. The mismatch between HTTP success and content emptiness is exactly what makes soft 404s invisible to status-based audits and is why they are easy to confuse with healthy pages. For the full status-code landscape these pages distort, see understanding HTTP status codes in server logs.
Since the status field is useless here, the next-best log signal is the response body size. A real article is tens of kilobytes; an empty-results template is often a few hundred bytes of boilerplate. That size gap is what we exploit.
Confirming the Symptom in Raw Logs
Pull the 200 responses and show their byte sizes. In combined format, $9 is the status and $10 is body_bytes_sent. List the smallest successful responses first:
awk '$9 == 200 {print $10, $7}' access.log | sort -n | head -10
Expected Output:
312 /search?q=zzqqxx
318 /tag/discontinued-2019
318 /tag/seasonal-archive
325 /shop/widget-out-of-stock
402 /search?q=asdfgh
512 /author/former-staffer
33914 /blog/real-article
41207 /products/popular-item
That band of 200 responses around 300–500 bytes is the tell. A real page on this site is 30 KB or more; anything returning a few hundred bytes with a 200 is almost certainly a shell template with no content — a soft 404. The paths confirm it: empty searches, an out-of-stock product, and tag archives with nothing in them.
Step-by-Step: Flag Suspicious Small 200s
Step 1: Establish the normal body-size baseline. Before picking a threshold, find your median real-page size so you do not flag legitimately compact pages. Compute the average body size of 200 responses to known content paths.
awk '$9 == 200 && $7 ~ /^\/blog\// {sum+=$10; n++} END {printf "avg blog body: %d bytes (n=%d)\n", sum/n, n}' access.log
Expected Output:
avg blog body: 38420 bytes (n=1043)
Step 2: Set a threshold and list offenders. A response an order of magnitude below the content baseline is a strong candidate. Here we flag 200 responses under 1,000 bytes and rank the worst paths by crawler hit count.
awk '$9 == 200 && $10 < 1000 {print $7}' access.log | sort | uniq -c | sort -nr | head -15
Expected Output:
840 /search
612 /tag/seasonal-archive
498 /shop/widget-out-of-stock
377 /tag/discontinued-2019
210 /author/former-staffer
95 /search?q=
Step 3: Scope to crawler traffic. A small 200 only wastes crawl budget if a crawler is fetching it. Restrict to verified search-engine bots so you act on real budget drain, not a single user. Verify the bot is genuine first (a spoofed agent skews the list) using the technique in detecting fake Googlebot traffic.
grep -i "googlebot\|bingbot" access.log \
| awk '$9 == 200 && $10 < 1000 {print $7}' \
| sort | uniq -c | sort -nr | head -10
Expected Output:
514 /search
389 /tag/seasonal-archive
301 /shop/widget-out-of-stock
188 /tag/discontinued-2019
Step 4: Confirm repeated revisits. A defining trait of a soft 404 is that crawlers keep coming back — they have no 404 signal telling them to drop the URL, so it stays in the crawl schedule indefinitely. Count distinct days each suspect path was crawled.
grep -i googlebot access.log \
| awk '$9 == 200 && $10 < 1000 && $7 == "/tag/seasonal-archive" {
split($4, d, ":"); print d[1]}' \
| sort -u | wc -l
Expected Output:
14
Fourteen distinct days of Googlebot fetching the same empty tag archive in one log window confirms a persistent soft 404 quietly draining budget.
Edge Cases & Gotchas
Legitimately tiny real pages. A redirect stub, an AMP variant, or a JSON API endpoint can legitimately return a small 200. Before remediating, exclude known-good small paths with a guard like $7 !~ /\.(json|xml)$/ and spot-check a couple of flagged URLs in a browser. The size threshold is a candidate filter, not a verdict.
Gzip and the byte-count meaning. body_bytes_sent is the bytes sent after compression when gzip is on, so a compressed real page can look smaller than you expect. If your server gzips HTML, raise the threshold and calibrate it against the compressed size of a known-good page rather than the uncompressed source.
Remediation
Fix soft 404s by aligning the HTTP signal with reality:
1. Return a real 404 or 410. When a product is gone or a tag is empty, the handler should send 404 Not Found (or 410 Gone if permanent). This is the single most important fix — it removes the URL from the crawl schedule and the index. After deploying, those paths will appear with status 404/410 in the log, which you can then track as part of normal crawl hygiene.
2. noindex thin-but-valid pages. Some thin pages should stay reachable for users (an out-of-stock product you will restock) but must not consume index slots. Add <meta name="robots" content="noindex,follow"> so the page stays live but exits the index.
3. Block worthless generators in robots.txt. Internal search results (/search?q=) should rarely be crawled at all; disallow them so the crawler never generates the soft-404 fetches in the first place. Coordinate this with your broader robots.txt and crawl rate control rules.
User-agent: *
Disallow: /search
4. Fix the template. If an empty tag archive renders a full 200 shell, the template logic is wrong: when the result set is empty, the controller should issue a 404 response rather than rendering the layout. This is the durable fix — it stops new soft 404s from ever being generated.
Production Warning: Returning 410 Gone is permanent and signals crawlers to drop the URL aggressively. Use it only for content that will never return; for temporary outages use 404 (or 503 if the whole page is briefly unavailable), so you do not strand URLs you intend to restore.
Verification
After deploying the template fix and robots rules, re-run the small-200 crawler query on a post-fix log window and confirm the count collapses while the same paths now appear as 404/410:
echo "Small 200s after fix:"
grep -i googlebot access.log | awk '$9 == 200 && $10 < 1000' | wc -l
echo "Now correctly 404/410:"
grep -i googlebot access.log | awk '$7 == "/tag/seasonal-archive" {print $9}' | sort | uniq -c
Expected Output:
Small 200s after fix:
23
Now correctly 404/410:
41 410
A drop from hundreds of small 200s to a couple dozen, plus the formerly-soft path now consistently returning 410, confirms the soft 404s are resolved and crawlers will stop revisiting them.
Common Mistakes
- Filtering only by status code. Because soft 404s return
200, any audit built ongrep 404misses them entirely. You must use body size (or a content check) as the signal — the status field is the very thing that is wrong. - Picking a static byte threshold for every site. A 1,000-byte cutoff that works on a content site flags legitimate API responses on another. Always calibrate the threshold against your own median real-page size, and account for gzip if HTML is compressed.
- Using 301 redirects as the fix. Redirecting an out-of-stock product to the homepage or a category is itself treated as a soft 404 by Google if the target is unrelated. Return a real
404/410, or redirect only to a genuinely equivalent page.
Frequently Asked Questions
Why can't I just grep for 404s to find soft 404s?
Because a soft 404 returns HTTP 200, not 404 — that is the entire definition of the problem. The application rendered a valid page and reported success even though the page has no real content. Status-code filtering will show the page as healthy. You have to use a different log signal, most practically the body_bytes_sent field, to flag 200 responses whose body is far too small to be a real page, then confirm against known empty-result path patterns.
What body size should I treat as a soft 404 threshold?
There is no universal number; calibrate it to your site. Compute the average or median body_bytes_sent for 200 responses on known good content paths (for example your /blog/ section), then flag responses an order of magnitude below that. If your server gzips HTML, measure against the compressed size of a known-good page, since body_bytes_sent reflects post-compression bytes. Treat the threshold as a candidate filter and spot-check flagged URLs before remediating.
How do I confirm a small 200 is really a soft 404 and not a tiny real page?
Combine three signals. First, the body size is far below your content baseline. Second, the path matches a known empty-result pattern such as search?q=, an empty /tag/, or an out-of-stock product. Third, crawlers revisit it across many days because no 404 ever told them to drop it. When all three line up, it is a soft 404; if the path is a legitimate JSON or redirect stub, exclude it and move on.
Related Guides
- Finding Crawl Budget Waste from URL Parameters — parameterized empty-result pages are a common soft-404 source.
- Identifying Orphan Pages from Log Analysis — another budget leak that status codes alone won't reveal.
- Understanding HTTP Status Codes in Server Logs — why a 200 can be misleading and when to return 404 vs 410.
Part of the Diagnosing Crawl Budget Waste series.