Parsing Cloudflare Logs for Crawl Analysis

When Cloudflare fronts your site, the only complete record of how search engines crawl it lives in Cloudflare Logpush, not in your origin access log. Logpush emits newline-delimited JSON — one JSON object per request, including the cache HITs that the edge answered without ever touching your origin. This guide is a focused jq workflow for turning that raw Logpush stream into crawl-analysis answers: which crawlers hit which URLs, what status they saw, and how much of their crawl the edge served from cache.

The work breaks into a handful of concrete tasks: read the key Logpush fields, isolate Googlebot, compute the cache HIT ratio for bot requests, and chart the status distribution the crawler actually received. It sits inside the broader CDN log analysis for SEO cluster, which explains why the edge log is authoritative and how to restore the real client IP at the origin; here we assume Logpush is already flowing and concentrate on parsing it.

Diagnosis: Confirm You Have Logpush JSON, Not Origin Logs

Before any analysis, confirm the file is Logpush newline-delimited JSON. Each physical line must be one complete JSON object. A quick structural check distinguishes a valid Logpush batch from a truncated or mis-configured export:

zcat logpush-20260619.log.gz | head -n1 | jq -e 'type == "object"' >/dev/null \
  && echo "valid NDJSON object per line" || echo "NOT line-delimited JSON"

Expected Output:

valid NDJSON object per line

If jq reports a parse error instead, you may have an array-wrapped export or a non-JSON log type. Logpush for HTTP requests is always one object per line; re-check the dataset and output format in the Logpush job configuration before continuing.

Concept: Why the Field Names Look Unusual

Cloudflare's Logpush field names are PascalCase and prefixed by where the value was observed — Client* for what the client sent, Edge* for what the edge did. The seven fields that carry a crawl analysis are below. Everything else in the default dataset is noise for this task.

Logpush field Example value Meaning for crawl analysis
ClientIP 66.249.66.1 Real crawler IP, already de-proxied at the edge — use this for reverse-DNS verification
ClientRequestHost example.com Hostname requested; separate crawl across apex, www, and subdomains
ClientRequestURI /pricing/?ref=nav Path and query the crawler fetched; the unit of crawl-waste analysis
EdgeResponseStatus 200 The status the crawler actually received (edge-authoritative)
CacheCacheStatus hit Whether the edge served from cache (hit) or forwarded upstream (miss/dynamic)
ClientRequestUserAgent Mozilla/5.0 (compatible; Googlebot/2.1; ...) Self-reported agent; a filter hint, never proof of identity
EdgeStartTimestamp 1750319662000000000 When the edge began handling the request — note the unit caveat below

The timestamp is in nanoseconds. EdgeStartTimestamp defaults to a Unix epoch value in nanoseconds (a 19-digit integer), not seconds and not milliseconds. Some Logpush jobs are configured to emit RFC 3339 strings (ending in Z) instead. You must know which your job emits before any time bucketing, because dividing a nanosecond value as if it were seconds throws your hourly buckets off by a factor of a billion. Detect it once:

zcat logpush-20260619.log.gz | jq -r '.EdgeStartTimestamp' | head -n1

Expected Output (numeric nanosecond form):

1750319662000000000

If instead you see 2026-06-19T08:14:22Z, your job emits RFC 3339 and you can skip the /1000000000 divisions in the hourly recipe below.

Step-by-Step: From Raw Logpush to Crawl Answers

Step 1: Project the seven fields into a compact view. Reduce each request to just the crawl-relevant keys so later filters stay readable. jq -c keeps one object per line:

zcat logpush-20260619.log.gz | jq -c '{
  ip: .ClientIP,
  host: .ClientRequestHost,
  uri: .ClientRequestURI,
  status: .EdgeResponseStatus,
  cache: .CacheCacheStatus,
  ua: .ClientRequestUserAgent,
  ts: .EdgeStartTimestamp
}' | head -n2

Expected Output:

{"ip":"66.249.66.1","host":"example.com","uri":"/pricing/","status":200,"cache":"hit","ua":"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)","ts":1750319662000000000}
{"ip":"157.55.39.27","host":"example.com","uri":"/blog/","status":200,"cache":"miss","ua":"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)","ts":1750319663100000000}

The first row is the payoff of edge logging: a Googlebot request served from cache ("cache":"hit") that your origin log never recorded.

Step 2: Filter to Googlebot requests. Match the user-agent string with a test() regex, anchoring on the literal Googlebot token. Use select() to keep only matching objects:

zcat logpush-20260619.log.gz \
  | jq -c 'select(.ClientRequestUserAgent | test("Googlebot"))' \
  | head -n2

Expected Output: compact JSON objects whose ClientRequestUserAgent all contain Googlebot. Counting them is a one-liner:

zcat logpush-20260619.log.gz \
  | jq -r 'select(.ClientRequestUserAgent | test("Googlebot")) | .ClientIP' \
  | wc -l

Expected Output:

51369

Safety Note: The user-agent string is self-reported and trivially forged — a scraper can claim to be Googlebot in one header. Cloudflare also supplies a BotScore (and a verified-bot flag) you can read instead, but treat even those as triage, not proof. For any decision that matters, verify the real ClientIP with a reverse DNS lookup that resolves back into Google's domains. The full procedure is in verifying Googlebot with reverse DNS, part of the identifying search engine bots cluster.

Step 3: Compute the cache HIT ratio for Googlebot. This is the number origin logs cannot give you: what fraction of Googlebot's crawl the edge answered without involving your origin. Tally the CacheCacheStatus values for Googlebot requests:

zcat logpush-20260619.log.gz \
  | jq -r 'select(.ClientRequestUserAgent | test("Googlebot")) | .CacheCacheStatus' \
  | sort | uniq -c | sort -rn

Expected Output:

  41280 hit
   8133 miss
    956 dynamic
    612 expired
    388 bypass

A hit means the origin never saw that crawl; everything else reached origin. To turn the tally into a single ratio, let jq do the arithmetic in one pass with reduce:

zcat logpush-20260619.log.gz \
  | jq -rs '
      map(select(.ClientRequestUserAgent | test("Googlebot")))
      | (map(select(.CacheCacheStatus == "hit")) | length) as $hit
      | length as $total
      | "Googlebot cache HIT ratio: \($hit) / \($total) = \((($hit*1000/$total)|floor)/10)%"'

Expected Output:

Googlebot cache HIT ratio: 41280 / 51369 = 80.3%

Production Warning: jq -s (slurp) reads the entire file into memory to build one array. On a multi-gigabyte Logpush batch this will exhaust RAM. For large files, compute the ratio with the streaming uniq -c tally above and divide the counts by hand, or pipe through awk to accumulate — never slurp a full day of edge logs. The slurped one-liner is for spot checks on a single hour, not a month of data.

Step 4: Chart the status distribution the crawler saw. Because EdgeResponseStatus is what the edge returned, this distribution catches edge-level 403, 429, and 503 responses the origin never logged. Bucket Googlebot's statuses by class:

zcat logpush-20260619.log.gz \
  | jq -r 'select(.ClientRequestUserAgent | test("Googlebot"))
           | (.EdgeResponseStatus / 100 | floor) as $class
           | "\($class)xx \(.EdgeResponseStatus)"' \
  | sort | uniq -c | sort -rn

Expected Output:

  44120 2xx 200
   4901 3xx 301
   1188 4xx 404
    902 3xx 308
    214 5xx 503
     44 4xx 410

Explanation: The 5xx 503 rows are worth attention — if those are edge rate-limit or "under attack" challenges rather than origin failures, Googlebot is being turned away at the edge and you would never see it in origin logs. To interpret each code, cross-reference understanding HTTP status codes in server logs.

Step 5: Bucket the crawl by hour from the nanosecond timestamp. Convert EdgeStartTimestamp to a UTC hour. Divide by 1,000,000,000 to get seconds, then format with strftime (which operates in UTC via gmtime):

zcat logpush-20260619.log.gz \
  | jq -r 'select(.ClientRequestUserAgent | test("Googlebot"))
           | (.EdgeStartTimestamp / 1000000000 | gmtime | strftime("%Y-%m-%dT%H:00Z"))' \
  | sort | uniq -c

Expected Output:

   1820 2026-06-19T06:00Z
   2110 2026-06-19T07:00Z
   2640 2026-06-19T08:00Z
   2480 2026-06-19T09:00Z

If your Logpush job emits RFC 3339 strings instead, replace the conversion with a slice: .EdgeStartTimestamp[0:13] + ":00Z".

Edge-Case Handling

Mixed bot families in one regex. Googlebot ships several agents (Googlebot-Image, Googlebot-Video, Storebot-Google, the Google-InspectionTool). A bare test("Googlebot") catches the image and video variants (they contain the token) but misses Storebot-Google and the inspection tool. To capture the whole Google fleet, widen the regex and make it case-insensitive: test("Googlebot|Storebot-Google|Google-InspectionTool"; "i"). Keep the families separate when you want per-agent crawl budgets, since image crawl and HTML crawl compete for different resources.

Null or empty cache status. Some request types (early hints, certain WebSocket upgrades) log an empty CacheCacheStatus. An unguarded select(.CacheCacheStatus == "hit") is fine, but a tally will show a blank row. Normalize with (.CacheCacheStatus // "none") so empty values bucket as none rather than vanishing or skewing the ratio denominator.

Verification

Confirm your HIT-ratio logic is internally consistent: the count of non-hit Googlebot rows in the edge log should be the maximum number of Googlebot requests that could appear in your origin log for the same window. If origin shows more Googlebot hits than the edge non-hit count, your origin is logging spoofed bots or the realip restoration is wrong.

# Edge: Googlebot requests that reached origin (everything but hit)
zcat logpush-20260619.log.gz \
  | jq -r 'select((.ClientRequestUserAgent | test("Googlebot")) and .CacheCacheStatus != "hit") | .ClientIP' \
  | wc -l

Expected Output: a count (e.g. 10089) that should be greater than or equal to the Googlebot lines in your origin access log for the same hour. A large origin excess points at unverified, spoofed Googlebot traffic hitting origin directly — investigate with reverse DNS verification.

Common Mistakes

  • Treating EdgeStartTimestamp as seconds. It is nanoseconds by default (a 19-digit integer). Feeding it straight into strftime or a date library yields timestamps tens of thousands of years in the future. Always divide by 1,000,000,000 first, or detect the RFC 3339 string form and slice instead.
  • Trusting the user-agent (or even BotScore) as proof of Googlebot. Both are hints. The agent string is forgeable and BotScore is heuristic. Gate any consequential decision on a reverse-DNS check of the real ClientIP, which the edge records de-proxied.
  • Slurping a full day of Logpush with jq -s. Slurp builds one in-memory array and will OOM on multi-gigabyte batches. Use streaming uniq -c tallies for full datasets and reserve -s for single-hour spot checks.

Frequently Asked Questions

Is Cloudflare Logpush always one JSON object per line?
Yes. For the HTTP requests dataset, Logpush emits newline-delimited JSON (NDJSON): each line is a complete, independent JSON object, which is exactly what jq consumes line by line and what NDJSON-aware pipelines ingest without an array wrapper. If a file parses as a single array, it was post-processed after export, not emitted that way by Logpush.

Why is EdgeStartTimestamp such a huge number?
By default it is the Unix epoch expressed in nanoseconds, so it has 19 digits. Divide by 1,000,000,000 to get seconds before any human-readable formatting. You can alternatively configure the Logpush job to emit an RFC 3339 string ending in Z, which is easier to read but a few bytes larger per line.

Cloudflare gives me a bot score — why still verify with reverse DNS?
The bot score and verified-bot flag are strong heuristics, but heuristics, not cryptographic proof. For crawl-budget decisions, security rules, or anything that gates access, confirm identity by reverse-resolving the real ClientIP and checking it forward-resolves back into the official Googlebot domains, as covered in the reverse-DNS guide.

Part of the CDN Log Analysis for SEO series.