Parsing Cloudflare Logs for Crawl Analysis
When Cloudflare fronts your site, the only complete record of how search engines crawl it lives in Cloudflare Logpush, not in your origin access log. Logpush emits newline-delimited JSON — one JSON object per request, including the cache HITs that the edge answered without ever touching your origin. This guide is a focused jq workflow for turning that raw Logpush stream into crawl-analysis answers: which crawlers hit which URLs, what status they saw, and how much of their crawl the edge served from cache.
The work breaks into a handful of concrete tasks: read the key Logpush fields, isolate Googlebot, compute the cache HIT ratio for bot requests, and chart the status distribution the crawler actually received. It sits inside the broader CDN log analysis for SEO cluster, which explains why the edge log is authoritative and how to restore the real client IP at the origin; here we assume Logpush is already flowing and concentrate on parsing it.
Diagnosis: Confirm You Have Logpush JSON, Not Origin Logs
Before any analysis, confirm the file is Logpush newline-delimited JSON. Each physical line must be one complete JSON object. A quick structural check distinguishes a valid Logpush batch from a truncated or mis-configured export:
zcat logpush-20260619.log.gz | head -n1 | jq -e 'type == "object"' >/dev/null \
&& echo "valid NDJSON object per line" || echo "NOT line-delimited JSON"
Expected Output:
valid NDJSON object per line
If jq reports a parse error instead, you may have an array-wrapped export or a non-JSON log type. Logpush for HTTP requests is always one object per line; re-check the dataset and output format in the Logpush job configuration before continuing.
Concept: Why the Field Names Look Unusual
Cloudflare's Logpush field names are PascalCase and prefixed by where the value was observed — Client* for what the client sent, Edge* for what the edge did. The seven fields that carry a crawl analysis are below. Everything else in the default dataset is noise for this task.
| Logpush field | Example value | Meaning for crawl analysis |
|---|---|---|
ClientIP |
66.249.66.1 |
Real crawler IP, already de-proxied at the edge — use this for reverse-DNS verification |
ClientRequestHost |
example.com |
Hostname requested; separate crawl across apex, www, and subdomains |
ClientRequestURI |
/pricing/?ref=nav |
Path and query the crawler fetched; the unit of crawl-waste analysis |
EdgeResponseStatus |
200 |
The status the crawler actually received (edge-authoritative) |
CacheCacheStatus |
hit |
Whether the edge served from cache (hit) or forwarded upstream (miss/dynamic) |
ClientRequestUserAgent |
Mozilla/5.0 (compatible; Googlebot/2.1; ...) |
Self-reported agent; a filter hint, never proof of identity |
EdgeStartTimestamp |
1750319662000000000 |
When the edge began handling the request — note the unit caveat below |
The timestamp is in nanoseconds. EdgeStartTimestamp defaults to a Unix epoch value in nanoseconds (a 19-digit integer), not seconds and not milliseconds. Some Logpush jobs are configured to emit RFC 3339 strings (ending in Z) instead. You must know which your job emits before any time bucketing, because dividing a nanosecond value as if it were seconds throws your hourly buckets off by a factor of a billion. Detect it once:
zcat logpush-20260619.log.gz | jq -r '.EdgeStartTimestamp' | head -n1
Expected Output (numeric nanosecond form):
1750319662000000000
If instead you see 2026-06-19T08:14:22Z, your job emits RFC 3339 and you can skip the /1000000000 divisions in the hourly recipe below.
Step-by-Step: From Raw Logpush to Crawl Answers
Step 1: Project the seven fields into a compact view. Reduce each request to just the crawl-relevant keys so later filters stay readable. jq -c keeps one object per line:
zcat logpush-20260619.log.gz | jq -c '{
ip: .ClientIP,
host: .ClientRequestHost,
uri: .ClientRequestURI,
status: .EdgeResponseStatus,
cache: .CacheCacheStatus,
ua: .ClientRequestUserAgent,
ts: .EdgeStartTimestamp
}' | head -n2
Expected Output:
{"ip":"66.249.66.1","host":"example.com","uri":"/pricing/","status":200,"cache":"hit","ua":"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)","ts":1750319662000000000}
{"ip":"157.55.39.27","host":"example.com","uri":"/blog/","status":200,"cache":"miss","ua":"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)","ts":1750319663100000000}
The first row is the payoff of edge logging: a Googlebot request served from cache ("cache":"hit") that your origin log never recorded.
Step 2: Filter to Googlebot requests. Match the user-agent string with a test() regex, anchoring on the literal Googlebot token. Use select() to keep only matching objects:
zcat logpush-20260619.log.gz \
| jq -c 'select(.ClientRequestUserAgent | test("Googlebot"))' \
| head -n2
Expected Output: compact JSON objects whose ClientRequestUserAgent all contain Googlebot. Counting them is a one-liner:
zcat logpush-20260619.log.gz \
| jq -r 'select(.ClientRequestUserAgent | test("Googlebot")) | .ClientIP' \
| wc -l
Expected Output:
51369
Safety Note: The user-agent string is self-reported and trivially forged — a scraper can claim to be Googlebot in one header. Cloudflare also supplies a BotScore (and a verified-bot flag) you can read instead, but treat even those as triage, not proof. For any decision that matters, verify the real ClientIP with a reverse DNS lookup that resolves back into Google's domains. The full procedure is in verifying Googlebot with reverse DNS, part of the identifying search engine bots cluster.
Step 3: Compute the cache HIT ratio for Googlebot. This is the number origin logs cannot give you: what fraction of Googlebot's crawl the edge answered without involving your origin. Tally the CacheCacheStatus values for Googlebot requests:
zcat logpush-20260619.log.gz \
| jq -r 'select(.ClientRequestUserAgent | test("Googlebot")) | .CacheCacheStatus' \
| sort | uniq -c | sort -rn
Expected Output:
41280 hit
8133 miss
956 dynamic
612 expired
388 bypass
A hit means the origin never saw that crawl; everything else reached origin. To turn the tally into a single ratio, let jq do the arithmetic in one pass with reduce:
zcat logpush-20260619.log.gz \
| jq -rs '
map(select(.ClientRequestUserAgent | test("Googlebot")))
| (map(select(.CacheCacheStatus == "hit")) | length) as $hit
| length as $total
| "Googlebot cache HIT ratio: \($hit) / \($total) = \((($hit*1000/$total)|floor)/10)%"'
Expected Output:
Googlebot cache HIT ratio: 41280 / 51369 = 80.3%
Production Warning: jq -s (slurp) reads the entire file into memory to build one array. On a multi-gigabyte Logpush batch this will exhaust RAM. For large files, compute the ratio with the streaming uniq -c tally above and divide the counts by hand, or pipe through awk to accumulate — never slurp a full day of edge logs. The slurped one-liner is for spot checks on a single hour, not a month of data.
Step 4: Chart the status distribution the crawler saw. Because EdgeResponseStatus is what the edge returned, this distribution catches edge-level 403, 429, and 503 responses the origin never logged. Bucket Googlebot's statuses by class:
zcat logpush-20260619.log.gz \
| jq -r 'select(.ClientRequestUserAgent | test("Googlebot"))
| (.EdgeResponseStatus / 100 | floor) as $class
| "\($class)xx \(.EdgeResponseStatus)"' \
| sort | uniq -c | sort -rn
Expected Output:
44120 2xx 200
4901 3xx 301
1188 4xx 404
902 3xx 308
214 5xx 503
44 4xx 410
Explanation: The 5xx 503 rows are worth attention — if those are edge rate-limit or "under attack" challenges rather than origin failures, Googlebot is being turned away at the edge and you would never see it in origin logs. To interpret each code, cross-reference understanding HTTP status codes in server logs.
Step 5: Bucket the crawl by hour from the nanosecond timestamp. Convert EdgeStartTimestamp to a UTC hour. Divide by 1,000,000,000 to get seconds, then format with strftime (which operates in UTC via gmtime):
zcat logpush-20260619.log.gz \
| jq -r 'select(.ClientRequestUserAgent | test("Googlebot"))
| (.EdgeStartTimestamp / 1000000000 | gmtime | strftime("%Y-%m-%dT%H:00Z"))' \
| sort | uniq -c
Expected Output:
1820 2026-06-19T06:00Z
2110 2026-06-19T07:00Z
2640 2026-06-19T08:00Z
2480 2026-06-19T09:00Z
If your Logpush job emits RFC 3339 strings instead, replace the conversion with a slice: .EdgeStartTimestamp[0:13] + ":00Z".
Edge-Case Handling
Mixed bot families in one regex. Googlebot ships several agents (Googlebot-Image, Googlebot-Video, Storebot-Google, the Google-InspectionTool). A bare test("Googlebot") catches the image and video variants (they contain the token) but misses Storebot-Google and the inspection tool. To capture the whole Google fleet, widen the regex and make it case-insensitive: test("Googlebot|Storebot-Google|Google-InspectionTool"; "i"). Keep the families separate when you want per-agent crawl budgets, since image crawl and HTML crawl compete for different resources.
Null or empty cache status. Some request types (early hints, certain WebSocket upgrades) log an empty CacheCacheStatus. An unguarded select(.CacheCacheStatus == "hit") is fine, but a tally will show a blank row. Normalize with (.CacheCacheStatus // "none") so empty values bucket as none rather than vanishing or skewing the ratio denominator.
Verification
Confirm your HIT-ratio logic is internally consistent: the count of non-hit Googlebot rows in the edge log should be the maximum number of Googlebot requests that could appear in your origin log for the same window. If origin shows more Googlebot hits than the edge non-hit count, your origin is logging spoofed bots or the realip restoration is wrong.
# Edge: Googlebot requests that reached origin (everything but hit)
zcat logpush-20260619.log.gz \
| jq -r 'select((.ClientRequestUserAgent | test("Googlebot")) and .CacheCacheStatus != "hit") | .ClientIP' \
| wc -l
Expected Output: a count (e.g. 10089) that should be greater than or equal to the Googlebot lines in your origin access log for the same hour. A large origin excess points at unverified, spoofed Googlebot traffic hitting origin directly — investigate with reverse DNS verification.
Common Mistakes
- Treating
EdgeStartTimestampas seconds. It is nanoseconds by default (a 19-digit integer). Feeding it straight intostrftimeor a date library yields timestamps tens of thousands of years in the future. Always divide by 1,000,000,000 first, or detect the RFC 3339 string form and slice instead. - Trusting the user-agent (or even BotScore) as proof of Googlebot. Both are hints. The agent string is forgeable and BotScore is heuristic. Gate any consequential decision on a reverse-DNS check of the real
ClientIP, which the edge records de-proxied. - Slurping a full day of Logpush with
jq -s. Slurp builds one in-memory array and will OOM on multi-gigabyte batches. Use streaminguniq -ctallies for full datasets and reserve-sfor single-hour spot checks.
Frequently Asked Questions
Is Cloudflare Logpush always one JSON object per line?
Yes. For the HTTP requests dataset, Logpush emits newline-delimited JSON (NDJSON): each line is a complete, independent JSON object, which is exactly what jq consumes line by line and what NDJSON-aware pipelines ingest without an array wrapper. If a file parses as a single array, it was post-processed after export, not emitted that way by Logpush.
Why is EdgeStartTimestamp such a huge number?
By default it is the Unix epoch expressed in nanoseconds, so it has 19 digits. Divide by 1,000,000,000 to get seconds before any human-readable formatting. You can alternatively configure the Logpush job to emit an RFC 3339 string ending in Z, which is easier to read but a few bytes larger per line.
Cloudflare gives me a bot score — why still verify with reverse DNS?
The bot score and verified-bot flag are strong heuristics, but heuristics, not cryptographic proof. For crawl-budget decisions, security rules, or anything that gates access, confirm identity by reverse-resolving the real ClientIP and checking it forward-resolves back into the official Googlebot domains, as covered in the reverse-DNS guide.
Related Guides
- Parsing JSON Access Logs with jq — the full jq cookbook these Logpush recipes draw on, for any NDJSON access log.
- Structured JSON Logging for Analysis — emit your origin logs in the same JSON shape as Logpush for a unified pipeline.
- Identifying Search Engine Bots in Server Logs — classify crawlers from the real ClientIP recovered at the edge.
- Verifying Googlebot with Reverse DNS — prove a Logpush ClientIP is the real Googlebot before trusting the label.
Part of the CDN Log Analysis for SEO series.