Understanding HTTP Status Codes in Server Logs
The status code is the single most actionable field in an access log: it tells you, per request, whether a crawler found content, was bounced through a redirect, hit a dead URL, or slammed into a failing backend. But the raw three-digit number is only half the story — a 200 can be a soft 404 wasting crawl budget, and a 302 can be silently bleeding link equity. This guide turns the status field into decisions you can act on.
You will isolate and rank status codes from the combined log format, read them by class (2xx/3xx/4xx/5xx) instead of one at a time, and run the targeted queries that expose the two failure modes raw counts hide: soft 404s and redirect chains. The status code lives in the %>s token, so this builds directly on log field interpretation and decoding and on knowing exactly which column holds it, covered in how to decode the Apache combined log format.
The Symptom: A Status Distribution You Can't Interpret
Crawl stats in Search Console show Googlebot spending budget but indexation is stalling. The access log holds the answer, but a flat list of status codes does not interpret itself — you need to bucket by class and spot the anomalies. In the standard combined log format the status code is the 9th space-delimited field ($9):
awk '{print $9}' access.log | grep -E '^[0-9]{3}$' | sort | uniq -c | sort -nr
Expected Output:
612340 200
48201 301
31755 304
12880 404
4120 302
910 500
214 503
This command targets the 9th field, validates it as a 3-digit status with regex (discarding malformed rows), aggregates occurrences, and ranks them by frequency for immediate anomaly spotting. The 48201 permanent redirects and 12880 404s are the first things to investigate; the 503s point at intermittent backend failures. Run the same pipeline against rotated archives to track historical error trends.
Concept: Read Status Codes by Class, Not One at a Time
Every status code belongs to a class, and the class tells you the crawl meaning and the action before you even look at the specific number. The decision reference below maps each class to what it means for a crawler and what you should do about it.
The class reference in table form, for the specific codes you will meet most:
| Code | Class | Crawl Meaning | Action |
|---|---|---|---|
| 200 | 2xx | Served, indexable | Verify it is not a soft 404 |
| 301 | 3xx | Permanent redirect | Consolidates signals; collapse chains |
| 302 | 3xx | Temporary redirect | Signals not consolidated; usually wrong long-term |
| 304 | 3xx | Not modified | Healthy; conditional GET, no body |
| 404 | 4xx | Not found | Fix internal links or return 410 |
| 410 | 4xx | Gone | Fastest signal for index pruning |
| 503 | 5xx | Service unavailable | Crawler backs off; fix urgently |
Step-by-Step: From Counts to Decisions
Step 1: Isolate the error classes with a streaming script.
Aggregating only 4xx and 5xx codes filters out the healthy traffic and surfaces what actually drains crawl budget. This Python script reads stdin line by line to stay memory-flat on huge files.
import sys
import collections
status_counts = collections.Counter()
for line in sys.stdin:
parts = line.split()
if len(parts) >= 9:
code = parts[8] # index 8 = 9th field (0-based)
if code.startswith(('4', '5')):
status_counts[code] += 1
for code, count in status_counts.most_common():
print(f'{code}: {count}')
Pipe your access log into it:
cat access.log | python3 parse_errors.py
Expected Output:
404: 12880
500: 910
503: 214
403: 96
The script safely extracts the status field at index 8, filters exclusively for client and server errors, and outputs a sorted frequency map for rapid SRE triage — the 404 and 503 rows are your priorities.
Step 2: Find soft 404s — 200s that are really errors.
A soft 404 returns 200 OK but serves a thin or empty error page, so it never appears in your 4xx counts yet still wastes crawl budget. Correlate status with response size (field $10 in combined format) to surface them.
# 200 OK responses with suspiciously small bodies (<500 bytes)
awk '$9 == "200" && $10+0 < 500 {print $7}' access.log \
| sort | uniq -c | sort -nr
Expected Output:
832 /search?q=nonexistent-product
410 /old-category/
118 /tag/discontinued/
The $10+0 coerces the size field to a number, treating - (Apache's zero-byte marker) as 0. These URLs return success but almost no content — classic soft 404s. Confirm and quantify them systematically with detecting soft 404s in server logs.
Step 3: Trace redirect chains the crawler is paying for.
Each hop in a redirect chain is a separate crawl request. Isolate Googlebot's redirect responses to see which paths cost extra round-trips.
grep -i "googlebot" access.log \
| awk '$9 ~ /^30[12]$/ {print $9, $7}' \
| sort | uniq -c | sort -nr | head
Expected Output:
4120 301 /blog
3980 301 /products
910 302 /cart
233 302 /login
The /blog and /products paths each fire thousands of 301s — likely a trailing-slash or HTTP-to-HTTPS hop that could be collapsed. The 302s on /cart are temporary redirects that should probably be 301s. Run them down with redirect chain optimization, and to enumerate the worst dead URLs feeding 404s, use finding the top 404 URLs with awk.
Production Warning: Before you trust any frequency analysis, confirm there are no gaps in the log window. A logrotate job that ran mid-analysis or a missing archive silently drops requests and skews every percentage. Verify your retention and rotation schedule first.
Edge Cases
304 Not Modified is healthy, not an error. A high 304 count means crawlers are sending conditional requests and your server is correctly saving bandwidth by returning no body. Do not lump 3xx together and alarm on it — exclude 304 from redirect-chain analysis with awk '$9 ~ /^30[12]$/' as shown, which matches only 301 and 302.
404 vs 410 changes how fast Google forgets a URL. Both are "the page is not here," but 410 Gone signals permanent removal and Google drops it from the index faster, while 404 invites repeat crawl attempts. For URLs you have deliberately retired, returning 410 stops the crawler from re-requesting them and reclaims budget. Reserve 404 for genuinely unexpected misses you intend to fix.
Verification
After you fix a batch of redirects or convert soft 404s to real 404/410 responses, re-run the class breakdown and confirm the distribution moved the right way:
awk '{c[substr($9,1,1)"xx"]++} END {for (k in c) print k, c[k]}' access.log \
| sort
Expected Output:
2xx 631200
3xx 41005
4xx 8900
5xx 480
The substr($9,1,1)"xx" groups every code by its leading digit into class buckets in a single pass. A dropped 3xx total (fewer redirect hops) and a dropped 4xx total (soft 404s now resolved or correctly pruned) confirm the fix landed. Snapshot this before and after every change so you can prove crawl-budget recovery.
Common Mistakes
- Treating all 200 OK responses as successful crawls. Many 200s are soft 404s or thin pages that waste budget. Cross-validate status against response length, title tags, and
meta robotsdirectives — a 200 with a 200-byte body is not a real page. - Misinterpreting 301 vs 302 redirect chains. Prolonged 302 chains cause crawler loops and index bloat; search engines treat 302 as temporary and do not consolidate link signals through them. Resolve chains to a single permanent 301, or an explicit 410, and audit server-side routing.
- Ignoring log rotation gaps during frequency analysis. Missing log segments skew status-code distribution metrics and make a fix look like it worked when the data is simply absent. Verify
logrotateschedules and retention windows before calculating any error rate.
Frequently Asked Questions
Why do my logs show 200 OK for pages Google marks as 'Not Found'?
This is a soft 404: the server returns a 200 status with a custom error page or empty content. Google's crawler analyzes the page content and title tags and overrides the HTTP status, classifying it as a soft 404. Detect them by correlating 200 responses with abnormally small response sizes, then return a real 404 or 410 so the crawler stops re-requesting them.
How do I distinguish between a legitimate 404 and a crawl budget drain?
Filter the log by user-agent to isolate verified search-engine bots. If Googlebot requests non-existent URLs at high frequency — visible by ranking 404 paths by hit count — that is a budget drain. Fix the internal links pointing at those URLs, or return a 410 Gone to stop the wasted crawl cycles permanently.
Can I safely ignore 5xx errors in historical logs?
No. Persistent 5xx errors, especially 502 and 503, signal infrastructure instability. They trigger crawler backoff, which directly reduces your crawl rate and delays indexation of new or updated content. Treat any sustained 5xx band as a crawl-budget emergency, not a transient blip.
Related Guides
- How to Decode the Apache Combined Log Format — confirm which column holds the status field before you parse it.
- Redirect Chain Optimization — collapse the 3xx hops this guide surfaces.
- Detecting Soft 404s in Server Logs — systematically catch 200s that are really errors.
- Finding the Top 404 URLs with awk — rank the dead URLs draining your budget.
Part of the Log Field Interpretation & Decoding series.