Robots.txt & Crawl Rate Control

Robots.txt directives and crawl-rate signals are only as effective as the crawler behavior they actually produce. The authoritative record of what a bot did — not what you asked — lives in your server logs. This guide shows how to use access logs to verify that Disallow rules, sitemap lastmod signals, and server responses (response time, 5xx, 503/429) are genuinely shaping how Googlebot, Bingbot, and Yandex pace their requests.

You will build a log-capture environment that records bot path, status, and response time; run a numbered pipeline to catch Disallow leaks, correlate slow or failing responses with crawl-rate drops, and confirm sitemap URLs are crawled; map each directive to its expected log signature; and recover from named failure modes. A core principle threads through all of it: blocked is not deindexed — a Disallow stops crawling, not indexing, and only your logs can prove the crawl actually stopped. For the broader strategy this fits into, start at the Crawl Budget Optimization & Bot Management pillar.

Prerequisites

Before correlating directives with crawler behavior, confirm the following are in place:

Read access to raw access logs containing the user-agent and request fields, ideally with a response-time field enabled (see Environment Setup).
A verified bot identity method. User-agent strings are trivially spoofed; pair them with reverse-DNS verification before trusting any "Googlebot" line.
A current robots.txt and XML sitemap whose contents you can diff against observed log activity.
Familiarity with status-code semantics, especially the difference between 200, 301, 404, 429, and 503. Review understanding HTTP status codes in server logs if those distinctions are fuzzy.
Comfort with CLI text processing. Most checks here are awk/grep pipelines; the CLI one-liners for quick audits cluster covers the underlying syntax.

The Crawl-Rate Control Loop

Crawl rate is not a static budget you set once. It is the output of a feedback loop: your server emits signals (response latency, error rates, robots.txt, sitemap freshness), the crawler adjusts its request rate in response, and that adjusted rate lands back in your logs as the next observation. The diagram below shows the signals, the crawler's scheduler, and the monitoring loop that closes back to your log store.

Environment Setup: Capturing Bot Path, Status, and Response Time

The default combined log format records the path and status but not response time — the single most important signal for understanding crawl-rate adaptation. Without it you cannot prove that a latency spike, rather than a robots.txt edit, throttled Googlebot. Extend your format first.

Step 1: Add Response Time to the Nginx Log Format
Define a dedicated crawl-analysis format that appends $request_time (and upstream time, if proxied) to the combined fields.

log_format crawlctrl '$remote_addr - $remote_user [$time_local] '
                     '"$request" $status $body_bytes_sent '
                     '"$http_referer" "$http_user_agent" '
                     'rt=$request_time uct="$upstream_connect_time"';

access_log /var/log/nginx/access.log crawlctrl;

Explanation: rt=$request_time records, in seconds, how long the server took to produce the response. This becomes the field Google's crawl scheduler implicitly reacts to.
Expected Output: New log lines end with a parseable suffix such as rt=0.043 uct="0.011".

Production Warning: Changing log_format and reloading Nginx (nginx -t && systemctl reload nginx) does not lose buffered requests, but any downstream parser keyed to a fixed column count will break. Update field offsets in your pipeline before the reload, not after.

Step 2: Confirm the Field Is Being Written
Verify the new format is live and the response-time token is present.

tail -n 3 /var/log/nginx/access.log | grep -o 'rt=[0-9.]*'

Expected Output: Three lines like rt=0.043, rt=1.872, rt=0.005.

Step 3: Isolate Verified Bot Traffic
Filter to crawler lines so all later steps operate on bots only. Pair the user-agent filter with reverse DNS for any line you intend to act on; spoofed agents otherwise pollute the rate analysis.

grep -iE 'googlebot|bingbot|yandex' /var/log/nginx/access.log \
  | awk '{print $1, $7, $9, $NF}' > /tmp/bot_activity.tsv
head -3 /tmp/bot_activity.tsv

Explanation: Extracts client IP ($1), path ($7), status ($9), and the trailing rt= field ($NF) into a working file.
Expected Output: 66.249.66.1 /products/a 200 rt=0.043 style rows ready for aggregation.

Pipeline Configuration: Verifying Directives Against Behavior

With bot activity captured, run these three numbered checks on a schedule. Each produces a concrete signal you can alert on. Treat them as a recurring job, not a one-off.

1. Measure Hit Rate Against Disallowed Paths (Catch Leaks)
A Disallow rule is a request to stop crawling, not a guarantee. Measure how many bot requests still land on disallowed prefixes after a reasonable propagation window (Google re-fetches robots.txt roughly every 24 hours).

# disallowed.txt holds one Disallow prefix per line, e.g. /cart/  /search
grep -iE 'googlebot' /var/log/nginx/access.log \
  | awk '{print $7}' \
  | grep -F -f disallowed.txt \
  | sort | uniq -c | sort -nr | head

Explanation: Cross-references every Googlebot-requested path against your active Disallow prefixes and counts hits. Any non-trivial count means the directive is not yet — or not correctly — being honored.
Expected Output (healthy): Empty, or a handful of hits decaying to zero over 48 hours.
Expected Output (leak): 412 /search?q=..., indicating sustained crawling of a blocked path.

Production Warning: Do not "fix" a leak by tightening robots.txt to a sitewide Disallow: / under time pressure. That is the single most damaging robots.txt mistake (see Validation). Verify the specific prefix and propagation window first.

2. Correlate 5xx / Slow Responses with Crawl-Rate Drops
Google reduces crawl rate when it detects server distress. Bucket bot requests by hour, alongside the error count and median response time, to see whether a rate drop tracks a health degradation.

awk '
  /googlebot/ {
    split($4,t,/[:[]/); hour=t[2]":"t[3];
    cnt[hour]++;
    if ($9 ~ /^5/) err[hour]++;
    rt=substr($NF,4)+0; sum[hour]+=rt;
  }
  END {
    for (h in cnt)
      printf "%s req=%d 5xx=%d avg_rt=%.3f\n", h, cnt[h], err[h], sum[h]/cnt[h];
  }' /var/log/nginx/access.log | sort

Explanation: Groups Googlebot activity per hour and reports request count, 5xx count, and average response time so a rate collapse can be lined up against rising latency or errors. For hour-by-hour granularity and visualization, the dedicated measuring crawl rate by hour from server logs guide extends this.
Expected Output:

00:14 req=820 5xx=0 avg_rt=0.052
00:15 req=805 5xx=2 avg_rt=0.061
00:16 req=240 5xx=140 avg_rt=2.910   <- rate drop tracking 5xx + latency spike

Production Warning: If a 5xx spike correlates with a rate drop, the fix is server capacity or fixing the failing endpoint — not editing robots.txt. Throttling via robots while the server is unhealthy compounds the loss of crawl budget.

3. Validate That Sitemap URLs Are Actually Crawled
A sitemap with fresh lastmod values is a suggestion of priority. Confirm the URLs you nominated are being fetched, and that recently changed ones get re-fetched.

# sitemap_urls.txt = one path per line, extracted from sitemap.xml
comm -23 \
  <(sort -u sitemap_urls.txt) \
  <(grep -iE 'googlebot' /var/log/nginx/access.log | awk '{print $7}' | sort -u)

Explanation: Lists sitemap paths that Googlebot has not requested in the log window — your candidate orphan or low-priority set. An empty result means full sitemap coverage.
Expected Output: A list of un-crawled sitemap paths, or empty if every nominated URL was fetched.

Production Warning: Bumping lastmod on every URL at every deploy trains crawlers to ignore the signal. Only update lastmod when content genuinely changes, or the freshness signal loses all weight.

Parsing Logic & Field Mapping: Directive to Log Signature

Each control mechanism leaves a distinct, verifiable trace in the logs. Use this table to know exactly what "working correctly" looks like before you go hunting for a problem. The key columns are the directive, the engines that honor it, and the log signature that proves compliance.

Directive / Signal	Intended effect	Honored by	Expected log signature when working	Failure signature
`Disallow: /path`	Stop crawling the prefix	Google, Bing, Yandex	Bot hits on `/path` decay to ~0 within ~24–48h	Sustained `200`/`404` bot hits on `/path`
`Allow: /path/keep`	Carve-out inside a Disallow	Google, Bing, Yandex	Continued bot hits on `/path/keep` only	No hits, or hits across whole parent
`Crawl-delay: 10`	Min seconds between requests	Bing, Yandex (NOT Google)	Bing/Yandex inter-request gap ≥10s	Google ignores it; gap unchanged
`503 Service Unavailable` + `Retry-After`	Temporary throttle / pause	Google, Bing, Yandex	Bot backs off, retries after the interval	Crawl stops entirely if 503 persists >2 days
`429 Too Many Requests`	Rate-limit signal	Google, Bing, Yandex	Reduced request rate within hours	Treated like 5xx if returned for static URLs
Slow `$request_time`	Implicit health signal	Google (adaptive)	Crawl rate scales down as latency rises	None — silent, only visible by correlation
Sitemap `<lastmod>`	Re-crawl priority hint	Google, Bing	Recently changed URLs re-fetched sooner	Ignored if lastmod is always "now"

SEO callout — blocked is not deindexed. A Disallow prevents crawling, so Google cannot see a noindex tag on that page. Pages blocked in robots.txt can still appear in results (URL-only, no snippet) if they are linked elsewhere. To remove a URL from the index, allow crawling and serve noindex, or use the Removals tool — never rely on Disallow alone.

SEO callout — throttle choice. To slow a crawler for a few hours during a deploy or incident, return 503 with Retry-After for the affected URLs; it is honored by all major engines and is reversible. Reserve robots.txt for permanent crawl-scope decisions. Using Disallow as a temporary throttle risks the directive outliving the incident and silently starving crawl budget.

Validation & Troubleshooting

Run these named recovery recipes when a check above misbehaves. Each maps a symptom to a root cause and a verification command.

Failure mode: robots.txt still being crawled after Disallow.
Bot hits on a disallowed path are not decaying. The usual causes are (a) the robots.txt cache hasn't refreshed, (b) a syntax error voids the rule, or (c) the path is reachable via a different host/protocol with its own robots.txt.

# Confirm the live file actually serves the rule (200, not a soft 404 page)
curl -sI https://example.com/robots.txt | head -1
curl -s  https://example.com/robots.txt | grep -nE 'Disallow|User-agent'

Expected Output: HTTP/2 200 and the exact Disallow: line under the correct User-agent: group. If the file 404s or returns HTML, every rule is ignored and the "leak" is the expected behavior. Allow up to 48 hours after a fix before re-measuring.

Failure mode: accidental sitewide Disallow.
A deploy ships Disallow: / (often a staging file overwriting production). Crawling collapses across the whole site within a day.

# Alarm if the live robots.txt blocks the root
curl -s https://example.com/robots.txt \
  | awk 'tolower($0) ~ /^disallow:[[:space:]]*\/[[:space:]]*$/ {print "ALERT: sitewide Disallow"; f=1} END{exit f?1:0}'

Expected Output: No output and exit code 0 when safe; ALERT: sitewide Disallow when the catastrophic rule is live. Wire this into CI and into a post-deploy health check so it can never ship silently.

Production Warning: A sitewide Disallow: / does not deindex immediately, but it stops Google from seeing canonical, sitemap, and content updates. Treat it as a P1 incident — fix the file, then expect a multi-day recovery as crawl rate ramps back up.

Failure mode: 503 storms suppressing crawl.
A flapping backend returns 503/5xx intermittently. Google interprets a sustained run of 503s (beyond ~2 days) as a signal to drop the URLs, not just pause them.

grep -iE 'googlebot' /var/log/nginx/access.log \
  | awk '$9 ~ /503|5[0-9][0-9]/ {c++} END{print "bot 5xx hits:", c+0}'

Explanation: Counts 5xx responses served to Googlebot. A persistent, growing count across days is the danger zone. Confirm the backend is healthy with the per-hour correlation from Pipeline step 2. The fix is server-side; do not paper over it with robots.txt.

Failure mode: crawl-delay ignored.
You set Crawl-delay: 10 to throttle Google and nothing changes. This is expected — Google does not honor Crawl-delay. Confirm which engine you are actually looking at, then throttle Google through Search Console crawl-rate settings or 503/Retry-After.

# Inter-request gap per engine (seconds between consecutive bot requests)
for bot in googlebot bingbot yandex; do
  echo -n "$bot median gap: "
  grep -i "$bot" /var/log/nginx/access.log \
    | awk '{split($4,t,/[:[]/); print t[2]*3600+t[3]*60+t[4]}' \
    | sort -n | awk 'NR>1{print $1-p} {p=$1}' \
    | sort -n | awk '{a[NR]=$1} END{print a[int(NR/2)]"s"}'
done

Expected Output: bingbot median gap: 10s, yandex median gap: 10s, but googlebot median gap: 1s — proving Google ignored the directive and must be throttled another way.

Common Mistakes

Treating Disallow as deindexing. Blocking a URL in robots.txt prevents Google from crawling it and therefore from seeing a noindex tag, so the URL can persist in the index URL-only. Fix: to remove a page, allow crawling and serve noindex or 410.
Using robots.txt for temporary throttling. A Disallow added during an incident is easy to forget, and it silently starves crawl budget for weeks. Fix: throttle with 503 + Retry-After, which is self-expiring and honored by all major engines.
Setting Crawl-delay to slow Googlebot. Google ignores Crawl-delay entirely; only Bing and Yandex honor it. Fix: use Search Console crawl-rate limits or response-based signals for Google.
Bumping sitemap lastmod on every deploy. Constant "fresh" timestamps train crawlers to distrust the signal, so genuinely updated pages lose their re-crawl priority. Fix: update lastmod only on real content changes.
Diagnosing crawl drops without the response-time field. Without $request_time in the log you cannot tell whether a robots.txt edit or a latency spike caused a rate drop. Fix: add the field to your log format first, as in Environment Setup.

Frequently Asked Questions

Does blocking a URL in robots.txt remove it from Google's index?
No. Disallow stops crawling, not indexing. A blocked URL that is linked from elsewhere can still appear in search results without a snippet, because Google never crawled it to see your noindex. To deindex, allow crawling and serve a noindex meta tag or 410, or use the Removals tool.

Should I use a 503 response or robots.txt to throttle a crawler temporarily?
Use 503 Service Unavailable with a Retry-After header for anything temporary — a deploy, a load spike, or an incident. It is honored by Google, Bing, and Yandex, and it expires on its own. Reserve robots.txt Disallow for permanent crawl-scope decisions, since a forgotten temporary Disallow quietly wastes crawl budget.

Why does Googlebot ignore my Crawl-delay directive?
Google has never supported Crawl-delay; only Bing and Yandex honor it. To slow Googlebot, use the crawl-rate controls in Search Console, return 503/429 under load, or improve response time so the adaptive scheduler naturally paces itself. You can confirm the directive is ignored by measuring per-engine inter-request gaps in your logs.

How long after editing robots.txt should I expect the crawl pattern to change?
Google typically re-fetches robots.txt about once every 24 hours, so allow 24–48 hours before judging whether a Disallow is being honored. If disallowed paths are still hit after that window, verify the file returns HTTP 200 with the correct syntax under the right User-agent group rather than assuming the rule failed.

Auditing robots.txt Effectiveness with Server Logs — the deep-dive leak audit that this guide's check #1 introduces.
Measuring Crawl Rate by Hour from Server Logs — hourly bucketing and visualization for the rate-correlation work above.
Diagnosing Crawl Budget Waste — where leaked and orphaned crawl activity gets quantified against value.
Understanding HTTP Status Codes in Server Logs — the 503/429/5xx semantics this page relies on.
CLI One-Liners for Quick Audits — the awk/grep foundations behind every pipeline here.

Part of the Crawl Budget Optimization & Bot Management series.

Robots.txt & Crawl Rate Control

Prerequisites #

The Crawl-Rate Control Loop #

Environment Setup: Capturing Bot Path, Status, and Response Time #

Pipeline Configuration: Verifying Directives Against Behavior #

Parsing Logic & Field Mapping: Directive to Log Signature #

Validation & Troubleshooting #

Common Mistakes #

Frequently Asked Questions #

Related Guides #