Auditing robots.txt Effectiveness with Server Logs

Publishing a robots.txt file is not the same as proving it works. The only ground truth for whether Googlebot actually respects your Disallow rules is your server access log: if a verified crawler is still requesting a blocked path, the directive failed somewhere between your editor and the crawler's fetch. This guide shows how to audit robots.txt effectiveness directly from logs — confirming the file itself serves a clean 200, grepping verified bot hits against your Disallow patterns, counting the leak rate, and isolating the four reasons a rule silently does nothing. Pair this audit with measuring crawl rate by hour from server logs to see whether a tightened rule actually moved crawl volume.

Diagnosis: A Disallowed Path Still Appears in the Log

The symptom is a path you believe is blocked showing up with a verified search-engine user agent against it. Start by listing every bot request whose URL matches a directory you intended to disallow — here, /cart/ and /search:

grep -E "Googlebot|bingbot" access.log | awk '$7 ~ /^\/(cart|search)/ {print $7, $9, $NF}' | sort | uniq -c | sort -nr | head

Expected Output:

    412 /search?q=blue+widget 200 "Googlebot/2.1"
    188 /cart/add?id=88 200 "Googlebot/2.1"
     27 /search?q=sale 200 "Googlebot/2.1"

If this returns rows, crawling of a "blocked" path is ongoing. The status code matters: a 200 means the crawler fetched and rendered the page, so the rule is being ignored or never applied. Reference the HTTP status codes in server logs guide if the response is a 301 or 404 instead — those mean the bot reached the path but your server, not robots.txt, dictated the outcome.

Concept: Why a Disallow Rule Silently Fails

A Disallow directive can be present and still permit crawling for four distinct reasons, and each leaves a different fingerprint in your logs:

Failure mode What happened Log fingerprint
Cached robots.txt Google caches robots.txt up to ~24h; a fresh rule has not taken effect Hits to blocked path stop ~24h after the robots.txt edit
Wrong user-agent group Rule sits under User-agent: AdsBot but Googlebot reads only its own group + * Only one bot ignores it; others obey
Syntax error / ordering A blank line splits a group, or a later Allow overrides the Disallow All bots ignore the specific rule
robots.txt unreachable The file returns 404/5xx, so Google assumes full-allow Crawler hits /robots.txt with a non-200 status

The most important conceptual gotcha: Disallow blocks crawling, not indexing. A blocked URL can still appear in search results (with no snippet) if other pages link to it, because Google indexes the URL without fetching it. If your real goal is de-indexing, robots.txt is the wrong tool — you need a noindex header or meta tag on a crawlable page, which means you must not disallow it. Auditing logs tells you whether crawling stopped; it cannot tell you whether indexing stopped.

Step-by-Step robots.txt Audit

Step 1: Confirm robots.txt itself serves a clean 200. If the file is unreachable, every rule in it is void. Check how crawlers actually received it, not how it looks in your browser:

grep "/robots.txt" access.log | grep -E "Googlebot|bingbot" | awk '{print $9}' | sort | uniq -c

Expected Output:

    144 200
      3 304

You want only 200 and 304 (not-modified). Any 404, 500, or 503 here is your root cause — Google treats a 5xx on robots.txt as "crawl nothing" temporarily, but a 404 as "crawl everything." Confirm the live fetch too:

curl -sI https://example.com/robots.txt | head -1

Expected Output:

HTTP/2 200

Production Warning: Do not test robots.txt reachability only from inside your network. CDN, WAF, or geo-rules can serve a 403 to crawler IP ranges while returning 200 to your office IP. Always trust the access log over a local curl.

Step 2: Extract the exact Disallow patterns from the served file. Audit against what is actually published, not what you remember writing:

curl -s https://example.com/robots.txt | awk 'BEGIN{IGNORECASE=1} /^user-agent/{ua=$2} /^disallow/{print ua, $2}'

Expected Output:

* /cart/
* /search
Googlebot /admin/

This reveals which group each rule belongs to. If your blocked path lives only under User-agent: AdsBot-Google, the standard Googlebot will never see it — that is the "wrong user-agent group" failure.

Step 3: Count the leak rate for each Disallow pattern. A leak rate is the share of verified-bot requests hitting a path you intended to block. Loop your patterns and quantify:

for p in "/cart/" "/search" "/admin/"; do
  hits=$(grep -E "Googlebot|bingbot" access.log | awk -v re="^${p}" '$7 ~ re' | wc -l)
  echo "$p -> $hits blocked-path hits"
done

Expected Output:

/cart/ -> 188 blocked-path hits
/search -> 439 blocked-path hits
/admin/ -> 0 blocked-path hits

Any non-zero count after a 48-hour cache window is a real leak. /admin/ at zero confirms that rule works; /search at 439 is wasting crawl budget on faceted noise. Feed these counts into your broader crawl budget waste diagnosis to prioritize which leaks cost the most.

Step 4: Correlate the leak against the robots.txt edit time. Distinguish a broken rule from a not-yet-cached one by checking whether hits decay after you deployed the change. If you deployed at 09:00 on Jun 17:

awk '$4 > "[17/Jun/2026:09:00:00" && /Googlebot/ && $7 ~ /^\/search/' access.log | wc -l

Expected Output:

0

Zero hits after the edit means the rule works and earlier hits were pre-cache. A steady count after 24+ hours means the rule is genuinely broken — move to the edge cases below.

Edge-Case Handling

Gotcha 1 — The trailing-slash and prefix trap. Disallow: /search blocks /search, /search?q=x, and /searchresults because robots.txt matching is prefix-based, not path-segment-based. If your audit shows /searchresults unexpectedly blocked, that is the cause. To block only the faceted endpoint, anchor it: Disallow: /search? blocks query strings while leaving the bare page crawlable. Verify with the log grep from Step 3 after the cache clears.

Gotcha 2 — A later Allow silently re-enables crawling. Google uses the most specific matching rule, not file order. A group containing both Disallow: /search and Allow: /search/help will let /search/help through. If your leak is confined to one sub-path under a disallowed directory, grep for it specifically — the "leak" may be a deliberate Allow you forgot about:

grep -E "Googlebot" access.log | awk '$7 ~ /^\/search\/help/ {c++} END{print c" allowed sub-path hits"}'

Expected Output:

612 allowed sub-path hits

Verification

After fixing the rule (correct group, anchored pattern, or restored 200 on the file), confirm the leak closes. Run a single command that returns the post-fix bot hit count on the blocked prefix over the last day; success is a flat zero once the cache window passes:

zgrep -hE "Googlebot|bingbot" access.log access.log.1 | awk '$7 ~ /^\/search/ {c++} END{print (c+0)" residual blocked-path hits"}'

Expected Output:

0 residual blocked-path hits

A clean zero confirms the directive is now enforced end to end. If you used awk and grep commands for log filtering to build these one-liners, schedule the verification grep as a weekly cron so a future robots.txt regression surfaces in the log before it drains crawl budget.

Common Mistakes

  • Trusting the Search Console robots.txt tester over the access log. The tester evaluates syntax against a single URL; it cannot see that a CDN served the file as a 404 to crawler IPs. Logs are the only proof of what the crawler actually received.
  • Expecting Disallow to remove a URL from the index. Blocking crawling can freeze a stale snippet in place. To de-index, allow crawling and serve noindex — confirm the bot can still reach the page in your log first.
  • Auditing before the cache window clears. Counting leaks within 24 hours of an edit produces false positives. Always correlate hits against the deploy timestamp (Step 4) before declaring a rule broken.

Frequently Asked Questions

Why is Googlebot still crawling a path I blocked in robots.txt?
The four common causes are a cached robots.txt (Google caches it up to ~24 hours), a rule placed under the wrong User-agent group, a syntax or ordering error such as a later Allow overriding the Disallow, or a robots.txt that returns a non-200 status to crawler IPs. Grep the blocked path in your access log, confirm /robots.txt itself serves a 200, and correlate hit times against your deploy timestamp to identify which one applies.

Does Disallow in robots.txt remove a page from Google's index?
No. Disallow blocks crawling, not indexing. A disallowed URL can still appear in search results — without a snippet — if other pages link to it, because Google indexes the URL without fetching it. To remove a page from the index, you must let it be crawled and serve a noindex directive instead.

How do I calculate the robots.txt leak rate from logs?
Filter the log to verified bot user agents, match each request URL against your Disallow prefixes, and count the matches per pattern after the cache window has passed. A non-zero count on a path you intended to block is the leak; a zero count confirms the rule is enforced. Track these counts over time to catch regressions.

Part of the Robots.txt & Crawl Rate Control series.