CDN Log Analysis for SEO (Cloudflare & Fastly)

When a CDN sits in front of your origin, your origin access logs stop telling the truth about how search engines crawl your site. The edge serves cached responses without ever touching the origin, and every request that does reach the origin arrives wearing the CDN's IP address instead of the crawler's. Analyze origin logs alone and you will undercount Googlebot, miscount status codes, and attribute every hit to a Cloudflare or Fastly edge node.

This guide shows how to reconstruct the true crawl picture across two log surfaces: the origin (Nginx/Apache) and the CDN edge (Cloudflare Logpush, Fastly real-time logs). You will restore the real client IP at the origin, parse edge-log JSON schemas, segment crawler traffic by cache status, and reconcile the two views so neither hides what the other sees.

Restore the real client IP at the origin so logs show crawlers, not edge nodes
Parse Cloudflare Logpush and Fastly edge fields to recover cache-masked crawl hits
Segment requests by edge response status and cache status for accurate crawl accounting
Reconcile edge versus origin counts to eliminate blind spots in bot analysis

This guide sits under Server Log Fundamentals & Compliance, which establishes the baseline log architecture you should understand before adding a CDN layer.

How a CDN Splits Your Crawl Picture in Two

A request from a search engine crawler does not travel in a straight line to your origin. It hits the nearest CDN edge node first. If the edge holds a fresh cached copy (a cache HIT), it answers immediately and the origin never sees that request. Only on a cache MISS does the edge forward the request upstream — and when it does, the origin sees the edge node's IP, not the crawler's. The diagram below shows where each request becomes visible and where the real client IP survives.

The practical consequence: edge logs are the only complete record of crawl behavior, because they capture both cache HITs and MISSes with the real client IP intact. Origin logs see only the MISS-forwarded subset, and only with the correct client IP if you configure IP restoration. Treat the edge log as the source of truth for crawl analysis and the origin log as a supplementary view of cache-miss load.

Prerequisites

Administrative access to your origin web server (Nginx with ngx_http_realip_module, or Apache with mod_remoteip).
The list of trusted CDN IP ranges for your provider (Cloudflare publishes IPv4/IPv6 ranges; Fastly publishes an equivalent list via its public API).
For edge logs: a Cloudflare Enterprise or Logpush-enabled zone, or a Fastly real-time log streaming endpoint (S3, GCS, or an HTTP collector).
jq installed for inspecting JSON edge logs, and a working knowledge of your Apache and Nginx log formats so you can compare origin and edge field semantics.

Environment Setup: Restoring the Real Client IP

The origin's first job is to stop logging CDN edge IPs. Both Cloudflare and Fastly forward the original client address in HTTP headers — CF-Connecting-IP (Cloudflare), True-Client-IP (Cloudflare Enterprise and Fastly), and the standard X-Forwarded-For (XFF) chain. The origin web server must be told to trust those headers, but only when the connecting peer is a known CDN address.

Nginx real_ip configuration. The realip module rewrites $remote_addr from a trusted header. Restrict the trusted set to your CDN's published ranges:

# /etc/nginx/conf.d/cloudflare-realip.conf
# Trust only Cloudflare edge ranges (abbreviated; load the full published list)
set_real_ip_from 173.245.48.0/20;
set_real_ip_from 103.21.244.0/22;
set_real_ip_from 2400:cb00::/32;

real_ip_header CF-Connecting-IP;
real_ip_recursive on;

Explanation: set_real_ip_from whitelists the edge ranges; real_ip_header names the header to trust; real_ip_recursive on walks the XFF chain past trusted hops to the leftmost untrusted address. With this in place, $remote_addr in your log format now holds the crawler's IP.

Apache mod_remoteip configuration. Apache uses an equivalent directive set:

# /etc/apache2/conf-available/cloudflare-realip.conf
RemoteIPHeader CF-Connecting-IP
RemoteIPTrustedProxy 173.245.48.0/20
RemoteIPTrustedProxy 103.21.244.0/22
RemoteIPTrustedProxy 2400:cb00::/32

Then log %a (the corrected remote IP) instead of %h in your LogFormat.

Verification. After reloading, send a request through the CDN and confirm the origin logs the real client IP, not an edge node:

nginx -t && systemctl reload nginx
# From an external host, request a real (cache-bypassing) URL, then:
tail -n1 /var/log/nginx/access.log | awk '{print $1}'

Expected Output: the leftmost field is your external test host's public IP (e.g. 203.0.113.45), not a Cloudflare range like 173.245.x.x. If you still see edge IPs, the header name or trusted ranges are wrong.

Safety Note: Never trust X-Forwarded-For from arbitrary source ranges. If set_real_ip_from includes 0.0.0.0/0 or you trust XFF without restricting the proxy list, any client can spoof their IP by sending a forged header, defeating bot verification and poisoning rate-limit and security rules. Trust only the published CDN ranges, and prefer the provider-specific header (CF-Connecting-IP, True-Client-IP) over raw XFF, because the CDN overwrites those headers and they cannot be spoofed end to end.

Edge log fields. Origin restoration fixes only the MISS-forwarded subset. To see the full crawl, enable edge logging. Cloudflare Logpush emits newline-delimited JSON with fields such as ClientIP, ClientRequestHost, EdgeResponseStatus, and CacheCacheStatus. Fastly's real-time logs are templated, so you define the schema yourself, typically mirroring the same concepts (client_ip, host, status, fastly_info.state).

Pipeline Configuration: From Edge Logs to Crawl Segments

With both surfaces emitting data, build a pipeline that restores IPs, parses the edge JSON, and segments by status and cache result. Work through these steps in order.

Step 1: Configure real client IP restore on every origin node. Push the realip config to all origin servers via your configuration manager, and pin the CDN range list to an automated refresh so new edge ranges are trusted without manual edits.

# Refresh Cloudflare ranges and rebuild the realip include
curl -s https://www.cloudflare.com/ips-v4 \
  | sed 's/^/set_real_ip_from /; s/$/;/' > /etc/nginx/conf.d/cf-ranges.conf
nginx -t && systemctl reload nginx

Expected Output: nginx: configuration file /etc/nginx/nginx.conf test is successful, followed by a clean reload. The generated file lists each Cloudflare range as a set_real_ip_from directive.

Production Warning: Run the range refresh on a schedule and validate with nginx -t before reloading. A malformed or empty download will produce a broken include; gate the reload on a successful config test so a failed fetch never takes the site offline.

Step 2: Parse Cloudflare Logpush JSON fields. Each Logpush line is a JSON object. Extract the crawl-relevant fields with jq to confirm the schema before wiring it into an ingestion pipeline:

zcat logpush-20260619.log.gz | jq -c '{
  ip: .ClientIP,
  host: .ClientRequestHost,
  ua: .ClientRequestUserAgent,
  status: .EdgeResponseStatus,
  cache: .CacheCacheStatus,
  ts: .EdgeStartTimestamp
}' | head -n3

Expected Output: compact JSON objects, one per request, such as:

{"ip":"66.249.66.1","host":"example.com","ua":"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)","status":200,"cache":"hit","ts":"2026-06-19T08:14:22Z"}

This Googlebot HIT never appeared in your origin logs — the edge served it from cache. That is exactly the data you were missing. Structured JSON makes this trivial to query; if you also want your origin logs in the same shape, see structured JSON logging for analysis.

Production Warning: Logpush batches can arrive minutes late and occasionally out of order. Do not treat a single batch as a complete time window; aggregate over a closed interval (for example, the previous full hour) before computing crawl-rate metrics.

Step 3: Segment by EdgeResponseStatus and CacheCacheStatus. The two fields together tell you what the crawler experienced and whether the origin was involved. Aggregate them to find cache coverage per status class:

zcat logpush-20260619.log.gz \
  | jq -r 'select(.ClientRequestUserAgent|test("Googlebot")) | "\(.EdgeResponseStatus) \(.CacheCacheStatus)"' \
  | sort | uniq -c | sort -rn

Expected Output: counts of each status/cache combination for Googlebot, e.g.:

  41280 200 hit
   9133 200 miss
    612 301 dynamic
    344 404 miss

Explanation: 200 hit rows are crawl hits the origin never logged. 200 miss and 404 miss rows are the only ones that reached the origin and should reconcile with origin log counts. dynamic means the edge treated the response as uncacheable and always forwarded it.

Safety Note: When filtering by user agent, remember the agent string is self-reported and trivially forged. The 66.249.x IP here is the real client IP from the edge, which you can verify by reverse DNS; never gate analysis on the user-agent string alone. See detecting fake Googlebot traffic in access logs for the verification procedure.

Parsing Logic & Field Mapping

Edge logs and origin logs name the same concepts differently. The table below maps the core Cloudflare Logpush fields to their meaning and SEO use. (Fastly equivalents are noted where the names diverge.)

Edge field (Cloudflare)	Fastly equivalent	Meaning	SEO use
`ClientIP`	`client_ip`	Real client address at the edge, already de-proxied	Verify Googlebot via reverse DNS; build crawler IP segments without origin realip config
`ClientRequestHost`	`req.http.host`	Hostname the client requested	Separate crawl by hostname/subdomain; spot crawl on staging or legacy hosts
`ClientRequestUserAgent`	`req.http.User-Agent`	Raw user-agent string sent by the client	Classify bot vs human; segment Googlebot, Bingbot, AI crawlers
`EdgeResponseStatus`	`status`	HTTP status the edge returned to the client	True status the crawler saw; catches edge-level 403/503 the origin never logged
`CacheCacheStatus`	`fastly_info.state`	Cache result: `hit`, `miss`, `expired`, `dynamic`, `bypass`, `unknown`	Distinguish origin-touching MISS hits from edge-served HITs; measure cache coverage of crawled URLs
`ClientRequestURI`	`req.url`	Path and query string requested	Identify crawl waste on parameterized URLs; map crawl frequency per path
`EdgeStartTimestamp`	`time_start`	When the edge began handling the request (UTC)	Build crawl-rate-by-hour metrics; reconcile with origin timestamps after timezone normalization

The two status fields deserve emphasis. EdgeResponseStatus is the status the crawler actually received, which can differ from the origin's status when the edge applies its own rules — a WAF challenge, a rate-limit 429, or a stale-while-revalidate 200 served during an origin outage. For an accurate picture of what search engines see, the edge status is authoritative. To interpret the status codes themselves, cross-reference understanding HTTP status codes in server logs under the broader log field interpretation and decoding cluster.

CacheCacheStatus is the field that explains why origin and edge counts disagree. Values map roughly as follows: hit (served from edge cache, origin untouched), miss (not cached, forwarded to origin), expired (cached but stale, revalidated against origin), dynamic (deemed uncacheable, always forwarded), bypass (cache explicitly skipped), and unknown (status could not be determined). Only miss, expired, dynamic, and bypass should appear in origin logs; hit rows are edge-exclusive.

Validation & Troubleshooting

Each failure mode below corrupts crawl analysis in a specific, recognizable way. Use the named recipe to confirm and recover.

Failure mode: all origin requests show the CDN IP. Symptom: awk '{print $1}' access.log | sort | uniq -c returns a tiny set of CDN ranges instead of diverse client IPs.

awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head

Expected (broken) Output: a handful of 173.245.x / 162.158.x rows accounting for nearly all traffic. Fix: the realip module is not active or the header is wrong. Confirm the module is loaded (nginx -V 2>&1 | grep -o realip) and that real_ip_header matches the header your CDN actually sends. After fixing, the same command should show many distinct client IPs.

Failure mode: spoofed X-Forwarded-For. Symptom: implausible client IPs in logs (private ranges, known crawler IPs from non-crawler request patterns) because XFF is trusted from untrusted peers.

# Look for RFC1918 or reserved addresses leaking through as client IPs
awk '{print $1}' /var/log/nginx/access.log \
  | grep -E '^(10\.|192\.168\.|172\.(1[6-9]|2[0-9]|3[01])\.|127\.)' | sort | uniq -c

Expected (healthy) Output: empty, or only your own internal monitoring hosts. Fix: tighten set_real_ip_from to published CDN ranges only and switch to the provider header (CF-Connecting-IP) instead of raw XFF. The CDN overwrites that header, so it cannot be forged by the original client.

Failure mode: cache hides the crawl. Symptom: origin logs show Googlebot fetching a page only a few times a week, but the page ranks freshly — because the edge serves most crawls from cache and the origin never sees them.

# Compare edge HIT vs origin-visible MISS for one URL
zcat logpush-*.log.gz \
  | jq -r 'select(.ClientRequestURI=="/pricing/" and (.ClientRequestUserAgent|test("Googlebot"))) | .CacheCacheStatus' \
  | sort | uniq -c

Expected Output: a high hit count alongside a small miss count, proving the origin log undercounts crawl frequency for that URL. Fix: always derive crawl-rate metrics from edge logs, not origin logs, for any cacheable path.

Failure mode: timezone and UTC mismatch in edge logs. Symptom: edge and origin crawl-rate-by-hour charts are offset by a fixed number of hours, so the same crawl spike appears at two different times.

# Edge timestamps are UTC (Zulu); origin is often local time
zcat logpush-*.log.gz | jq -r '.EdgeStartTimestamp' | head -n1
tail -n1 /var/log/nginx/access.log | grep -oE '\[[^]]+\]'

Expected Output: the edge value ends in Z (e.g. 2026-06-19T08:14:22Z), while the origin shows a local offset (e.g. [19/Jun/2026:10:14:22 +0200]). Fix: normalize both to UTC before joining. Cloudflare Logpush and Fastly emit UTC by default; convert origin timestamps to UTC during ingestion so the two surfaces line up on a shared clock.

Common Mistakes

Analyzing only origin logs behind a CDN. Root cause: assuming the origin sees every request. The edge serves cache HITs without touching the origin, so origin logs systematically undercount crawl frequency for cacheable URLs. Fix: treat edge logs as the source of truth for crawl analysis.
Trusting X-Forwarded-For from any source. Root cause: a permissive set_real_ip_from (or trusting XFF without a proxy whitelist) lets clients forge their IP. Fix: trust only published CDN ranges and prefer CF-Connecting-IP / True-Client-IP, which the edge overwrites.
Reconciling edge status against origin status as if they must match. Root cause: ignoring that the edge can return 403, 429, or 503 on its own. Fix: use EdgeResponseStatus for what the crawler saw, and expect divergence on WAF challenges and stale-while-revalidate hits.
Joining edge and origin logs without normalizing timezones. Root cause: edge logs are UTC while origin logs are often local time. Fix: convert everything to UTC during ingestion before computing any time-bucketed metric.
Filtering crawlers by user-agent string alone. Root cause: the user agent is self-reported and easily spoofed. Fix: verify the real ClientIP from the edge log via reverse DNS before trusting any Googlebot label.

Frequently Asked Questions

Why does my origin log show Cloudflare or Fastly IPs instead of the visitor's address?
Because the CDN terminates the client connection at the edge and opens a new connection to your origin from an edge node. Until you configure the realip module (Nginx) or mod_remoteip (Apache) to trust the CDN ranges and read CF-Connecting-IP or True-Client-IP, every origin log line records the edge node's IP.

Do I still need origin logs if I have Cloudflare Logpush?
Edge logs are the complete crawl record, so for crawl-frequency and bot analysis they are sufficient and more accurate. Origin logs remain useful for measuring true origin load (the cache-miss subset), debugging application errors the edge passes through, and cross-checking that the edge and origin agree on the MISS-forwarded requests.

How do I tell which crawler hits were served from cache versus forwarded to my origin?
Read the CacheCacheStatus field in the edge log. A value of hit means the edge answered from cache and the origin never saw the request; miss, expired, dynamic, and bypass all mean the request reached the origin. Segmenting Googlebot traffic by this field reveals exactly how much crawl your origin logs are missing.

Are Cloudflare and Fastly edge log timestamps in local time or UTC?
Both emit timestamps in UTC (Cloudflare uses an ISO 8601 Zulu format ending in Z). Origin web servers usually log in the server's local time. Normalize both to UTC before joining the two datasets, or crawl-rate charts will be offset by your timezone difference.

Parsing Cloudflare Logs for Crawl Analysis — the step-by-step jq and pipeline workflow for Logpush crawl data.
Apache vs Nginx Log Formats — how origin field semantics compare so you can align edge and origin schemas.
Structured JSON Logging for Analysis — emit origin logs in the same JSON shape as edge logs for a unified pipeline.
Identifying Search Engine Bots in Server Logs — classify and verify crawlers from the real client IP recovered at the edge.
Detecting Fake Googlebot Traffic — verify edge ClientIP via reverse DNS before trusting any user-agent label.

Part of the Server Log Fundamentals & Compliance series.

CDN Log Analysis for SEO (Cloudflare & Fastly)

How a CDN Splits Your Crawl Picture in Two #

Prerequisites #

Environment Setup: Restoring the Real Client IP #

Pipeline Configuration: From Edge Logs to Crawl Segments #

Parsing Logic & Field Mapping #

Validation & Troubleshooting #

Common Mistakes #

Frequently Asked Questions #

Related Guides #