Measuring Crawl Rate by Hour from Server Logs
Crawl rate is the heartbeat of how Google treats your site. When you change robots.txt, set a crawl-delay, ship a slow release, or get hit with a wave of 5xx errors, Googlebot responds by speeding up or backing off — and the only place that response is recorded in full is your access log. This guide shows how to split the hour out of each log timestamp, count verified Googlebot requests per hour and per day, normalize the log's UTC time to your reporting timezone, and emit a CSV you can chart to spot the spikes and drops that matter.
This sits inside robots.txt and crawl rate control: you cannot tell whether a directive worked without a before-and-after crawl-rate series. By the end you will have an hourly time series, a daily rollup, and a decision rule for what a sudden change means.
Diagnosis: The Hourly Signal Hiding in Timestamps
A single access-log line carries the timestamp you need, wrapped in brackets in the fourth field of the combined format:
66.249.66.1 - - [19/Jun/2026:14:07:33 +0000] "GET /products/ HTTP/1.1" 200 5123 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
The hour you want is 14, buried in [19/Jun/2026:14:07:33 +0000]. The +0000 tells you the clock is UTC — almost always true for production web servers, and the reason a "midnight spike" in your local time can look like a 2 p.m. event in the raw log. The task is to extract that hour reliably, count only real Googlebot, and shift to the timezone your stakeholders read charts in.
Confirm the timestamp format before parsing anything:
awk '{print $4}' access.log | head -3
Expected Output:
[19/Jun/2026:14:07:33
[19/Jun/2026:14:07:35
[19/Jun/2026:14:08:01
Concept: Why Verified Counts, Not Raw User-Agent Counts
If you count every line whose user-agent says "Googlebot," you will overcount. A large share of "Googlebot" traffic is spoofed by scrapers and competitors borrowing the string. A crawl-rate chart built on the raw user-agent therefore measures impostors as much as Google, and a "spike" may just be a scraper run. Filter to traffic whose source IP resolves to googlebot.com/google.com via reverse-then-forward DNS, and your series reflects Google's actual behavior. We approximate this cheaply with Google's published IP prefixes; for the full forward-confirmed method, pair this with auditing robots.txt effectiveness with server logs, which leans on the same verified-bot foundation.
The payoff is interpretive. Googlebot adjusts crawl rate based on two things: how fast your server responds, and how many errors it sees. A series of 5xx or slow responses makes Google throttle down to protect your server; the hourly chart shows the crawl rate falling in the hours after the errors. A successful, fast site earns a higher steady rate. Reading the curve against your deploy and error timeline is the whole point.
Step-by-Step: Build the Hourly Crawl-Rate Series
Step 1: Isolate verified Googlebot requests. Match the user-agent, then restrict to Google's IP ranges so spoofers are excluded. Keep the source IP, the timestamp, and the status for later analysis.
grep -F 'Googlebot' access.log \
| grep -E '^(66\.249\.(6[4-9]|7[0-9]|8[0-9]|9[0-5])\.|34\.|35\.)' \
> googlebot.log
wc -l googlebot.log
Explanation: grep -F does a fast fixed-string user-agent match; the second grep keeps only the published 66.249.64.0/19 block (and Google Cloud ranges Googlebot also uses). Expected Output:
48213 googlebot.log
Step 2: Split the hour out of the timestamp. The awk substring trick takes field 4 ([19/Jun/2026:14:07:33) and pulls the two characters at the hour position.
awk '{ ts=$4; print substr(ts, 14, 2) }' googlebot.log | head -3
Explanation: In [19/Jun/2026:14:07:33, character 14 is the 1 of 14. substr(ts,14,2) yields the hour. Expected Output:
14
14
15
Step 3: Count requests per hour (UTC). Add the date so each hour bucket is unique across days, then tally.
awk '{ split($4, a, /[:\[]/); print a[2]"/"a[3]"/"a[4]" "a[5]":00" }' googlebot.log \
| sort | uniq -c | head
Explanation: split($4, a, /[:\[]/) breaks [19/Jun/2026:14:07:33 on [ and :, giving a[2]=19/Jun/2026, a[5]=14. We rebuild a DD/Mon/YYYY HH:00 bucket. Expected Output:
1881 19/Jun/2026 13:00
2043 19/Jun/2026 14:00
410 19/Jun/2026 15:00
1990 19/Jun/2026 16:00
The drop to 410 at 15:00 is the kind of signal you are hunting — read on to correlate it with errors.
Step 4: Correlate drops with 5xx and slow responses. Crawl rate falls after Google sees trouble. Count 5xx responses per hour alongside the request count.
awk '{ h=substr($4,14,2); total[h]++; if ($9 ~ /^5/) err[h]++ }
END { for (h in total) printf "%s\t%d req\t%d 5xx\n", h, total[h], err[h]+0 }' googlebot.log \
| sort
Explanation: One pass tallies total requests and 5xx count per hour. Expected Output:
13 1881 req 2 5xx
14 2043 req 61 5xx
15 410 req 0 5xx
16 1990 req 0 5xx
The story is clear: a burst of 5xx at 14:00 is followed by Google throttling to 410 requests at 15:00. To classify those error codes precisely, see understanding HTTP status codes in server logs.
Step 5: Roll up to a daily total. Stakeholders often want crawls-per-day as the headline metric.
awk '{ split($4, a, /[:\[]/); print a[2]"/"a[3]"/"a[4] }' googlebot.log \
| sort | uniq -c
Expected Output:
46180 18/Jun/2026
48213 19/Jun/2026
Step-by-Step: Normalize Timezone and Export CSV
Step 6: Shift UTC to your reporting timezone. Logs are UTC; if you report in, say, US Eastern (UTC−4 in summer), a naive hour bucket misattributes traffic to the wrong local hour. Convert each timestamp before bucketing.
awk '{
ts=$4; gsub(/[\[]/,"",ts);
cmd="date -d \"" substr(ts,1,11) " " substr(ts,13,8) " UTC\" +\"%Y-%m-%d %H\" -u 2>/dev/null"
}' googlebot.log >/dev/null
# Practical version: convert with the shell TZ for a whole-file shift
TZ="America/New_York" awk -v off=-4 '{
h=(substr($4,14,2)+off+24)%24;
printf "%02d:00\n", h
}' googlebot.log | sort | uniq -c | head
Explanation: The arithmetic version adds a fixed offset (-4) to the UTC hour and wraps with %24 — fast and dependable for a single, known offset. For dates crossing daylight-saving boundaries, prefer a true conversion in your charting layer; the cleanest fix is to normalize log timestamp timezones in Python, where zoneinfo handles DST correctly. Safety Note: A fixed-offset shift silently breaks twice a year at the DST transition; never use it for compliance or billing series.
Expected Output:
1990 09:00
1881 10:00
2043 11:00
410 12:00
Step 7: Emit a chartable CSV. Produce hour,requests,errors so any spreadsheet or plotting tool can render the time series.
awk 'BEGIN { print "datetime_utc,requests,server_errors" }
{ split($4,a,/[:\[]/); key=a[4]"-"a[3]"-"a[2]"T"a[5]":00";
req[key]++; if ($9 ~ /^5/) err[key]++ }
END { for (k in req) printf "%s,%d,%d\n", k, req[k], err[k]+0 }' googlebot.log \
| (read -r head; echo "$head"; sort) > crawl_rate.csv
head -4 crawl_rate.csv
Explanation: Builds an ISO-ish 2026-Jun-19T14:00 key, counts requests and 5xx, and writes a header-first sorted CSV. Expected Output:
datetime_utc,requests,server_errors
2026-Jun-19T13:00,1881,2
2026-Jun-19T14:00,2043,61
2026-Jun-19T15:00,410,0
Edge-Case Handling: Spikes, Gaps, and Multi-File Logs
Gotcha 1 — A spike that is not Google. A sudden hourly jump can be a spoofer that slipped past your IP filter, or a new Google product (e.g., an inspection or AdsBot fetch) that crawls in bursts. Confirm the IP block and the exact user-agent token before declaring a real crawl-rate increase; the same verified-bot discipline used in awk and grep commands for log filtering keeps the series honest.
Gotcha 2 — Empty hours vanish. uniq -c only emits hours that had at least one hit, so a zero-crawl hour disappears and your chart connects across the gap, hiding a real throttle. Generate a full 0–23 scaffold and left-join your counts onto it so empty hours render as 0.
Gotcha 3 — Rotated and compressed logs. A full day may span access.log, access.log.1, and access.log.2.gz. Feed them all in one stream with zcat -f, which transparently handles both plain and gzip files:
zcat -f access.log access.log.1 access.log.2.gz \
| grep -F 'Googlebot' | awk '{print substr($4,14,2)}' | sort | uniq -c
Expected Output: a complete 24-hour distribution spanning the rotation boundary.
Verification: Sanity-Check the Series Against a Known Total
Confirm the hourly buckets sum to the same total as a flat count — if they do not, your substr offset or split pattern is wrong.
echo "sum of hourly: $(awk '{print substr($4,14,2)}' googlebot.log | sort | uniq -c | awk '{s+=$1} END{print s}')"
echo "flat total: $(wc -l < googlebot.log)"
Expected Output:
sum of hourly: 48213
flat total: 48213
Matching totals confirm every request landed in exactly one hour bucket.
Common Mistakes
- Hard-coding the wrong substring offset.
substr($4,14,2)assumes the standard[DD/Mon/YYYY:HHlayout. A custom log format, or a missing leading bracket, shifts every character and silently extracts the minute or the year digits. Always run theawk '{print $4}' | headcheck first and adjust the offset to your real format. - Counting the raw user-agent as the crawl rate. Skipping the IP-range filter folds spoofed Googlebot into your numbers, so a scraper run looks like Google ramping up. Verify the source before trusting any spike.
- Bucketing UTC logs as if they were local time. Reporting a UTC hour to a stakeholder who thinks in local time misplaces every event by the offset and makes deploy correlations wrong. Normalize the timezone — ideally with a DST-aware tool — before charting.
Frequently Asked Questions
My logs are in UTC but I report in a local timezone — what is the safest way to convert?
For a quick single-offset view, add the offset to the extracted hour and wrap modulo 24, as in Step 6. That is fine for a one-day snapshot but breaks at daylight-saving transitions, where the offset changes mid-series. For any multi-day or recurring report, do the conversion with a DST-aware library — Python's zoneinfo applied per-timestamp is the reliable choice and avoids the twice-a-year off-by-one.
Why did Googlebot's crawl rate suddenly drop even though I did not change robots.txt?
The most common cause is server health. When Googlebot encounters a run of 5xx errors or unusually slow responses, it automatically throttles down to avoid overloading your origin, and that backoff shows up as falling hits in the hours after the errors. Correlate the drop with your 5xx count and response times per hour (Step 4); the dip almost always trails an error or latency spike rather than a robots.txt edit.
Should I count only HTML page crawls or all Googlebot requests in the rate?
It depends on the question. For raw crawl-budget consumption, count everything Googlebot fetches, including CSS, JS, and images, because they all draw from the same budget. For an "are my pages being recrawled" view, filter to document responses by excluding asset extensions, exactly as you would when isolating document URLs for any crawl-coverage analysis. Keep the two series separate so a render-asset spike does not masquerade as page-crawl growth.
Related Guides
- Auditing robots.txt Effectiveness with Server Logs — confirm a directive worked by reading crawl behavior before and after.
- Understanding HTTP Status Codes in Server Logs — interpret the
5xxand slow-response signals that drive throttling. - Normalizing Log Timestamp Timezones in Python — DST-correct timezone conversion for multi-day crawl-rate series.
- awk and grep Commands for Log Filtering — the extraction and verified-bot filtering primitives behind every command here.
Part of the Robots.txt and Crawl Rate Control series.