Detecting Fake Googlebot Traffic in Access Logs

Scrapers, vulnerability scanners, and content thieves routinely set their User-Agent to Googlebot, betting that you will wave them through. The cost is real: spoofed crawler traffic skews your crawl-budget analytics, hides inside your "good bot" allowlists, and hammers expensive endpoints while you assume Google is just being thorough. This guide finds requests that claim the Googlebot user-agent but fail forward-confirmed reverse DNS, quantifies how much of your traffic is fake, isolates the offending IPs and the paths they target, and rate-limits or blocks them in Nginx without ever touching the real crawler.

If you have not already built the verification primitive, read verifying Googlebot with reverse DNS lookup first — this page reuses that FCrDNS check as its core filter and is part of the broader work of identifying search engine bots in server logs.

The Symptom: Googlebot Volume That Doesn't Add Up

Your logs show tens of thousands of Googlebot hits, but Search Console crawl stats report a fraction of that, and a suspicious share of the "crawler" requests land on login pages, xmlrpc.php, or parameterized search URLs that Google would never prioritize. Those are the fingerprints of impersonation.

Count the unique IPs claiming to be Googlebot, using the same field conventions as awk and grep commands for log filtering:

grep -i "googlebot" access.log | awk '{print $1}' | sort -u | wc -l

Expected Output:

214

Real Googlebot uses a small, stable set of IP ranges. If you see hundreds of distinct IPs scattered across unrelated networks, most are fake. Confirm by looking at where they go:

grep -i "googlebot" access.log | awk '{print $1, $9, $7}' \
  | grep -E " (40[0-9]|50[0-9]) " | sort | uniq -c | sort -nr | head

Expected Output:

   612 45.143.200.18 401 /wp-login.php
   388 185.220.101.7 403 /xmlrpc.php
   201 23.95.97.59 404 /.env
    97 91.219.236.4 403 /administrator/

Genuine Googlebot does not brute-force /wp-login.php. These error responses — read more in understanding HTTP status codes in server logs — are a strong secondary signal, but the definitive test is still FCrDNS.

Concept: Why the User-Agent Lies and DNS Doesn't

The user-agent is a client-supplied string with zero authentication; anyone can send Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). What an impersonator cannot forge is control of Google's reverse-DNS zone. A real Googlebot IP reverse-resolves to a *.googlebot.com hostname that then forward-resolves back to the same IP. Fake bots fail that round-trip. So "fake Googlebot" is defined precisely as: a request whose user-agent contains Googlebot and whose source IP fails forward-confirmed reverse DNS. Everything below operationalizes that definition at log scale.

Step-by-Step: Find, Quantify, and Block

Step 1: Extract the unique claimed-Googlebot IPs.
Reduce the log to the distinct IPs you need to verify, so you run DNS once per IP rather than once per request.

grep -i "googlebot" access.log | awk '{print $1}' | sort -u > claimed_gbot_ips.txt
wc -l claimed_gbot_ips.txt

Expected Output:

214 claimed_gbot_ips.txt

Step 2: Verify each IP and split into real vs. fake.
Reuse the verify-bot.sh FCrDNS script from the verification guide. Loop over the unique IPs and partition them by exit status.

: > real_gbot.txt; : > fake_gbot.txt
while read -r ip; do
  if ./verify-bot.sh "$ip" >/dev/null 2>&1; then
    echo "$ip" >> real_gbot.txt
  else
    echo "$ip" >> fake_gbot.txt
  fi
done < claimed_gbot_ips.txt
echo "real=$(wc -l < real_gbot.txt) fake=$(wc -l < fake_gbot.txt)"

Expected Output:

real=9 fake=205

Nine IPs are genuine Googlebot; 205 are impostors. The user-agent claimed all 214 were Google.

Step 3: Quantify the fake traffic share.
Translate the IP split into request volume so you can size the problem. Count log lines attributable to fake IPs.

total=$(grep -ic "googlebot" access.log)
fake=$(grep -i "googlebot" access.log | awk '{print $1}' \
       | grep -Fwf fake_gbot.txt | wc -l)
awk -v t="$total" -v f="$fake" 'BEGIN{printf "fake %d of %d Googlebot hits = %.1f%%\n", f, t, 100*f/t}'

Expected Output:

fake 48217 of 51904 Googlebot hits = 92.9%

Ninety-three percent of "Googlebot" traffic is spoofed. That is the number to act on.

Step 4: Profile the fake bots' targets and request rate.
Before blocking, understand what they hit and how fast, so your rate limit is calibrated and you can spot attack patterns. List the top paths fake bots request:

grep -i "googlebot" access.log | grep -Fwf fake_gbot.txt \
  | awk '{print $7}' | sort | uniq -c | sort -nr | head

Expected Output:

  21044 /wp-login.php
  11890 /xmlrpc.php
   8002 /?s=
   3771 /.env

And the per-IP request rate (hits per minute) for the busiest impostor:

top=$(sort fake_gbot.txt | head -1)
grep -i "googlebot" access.log | grep -Fw "$top" \
  | awk -F'[:[]' '{print $2":"$3":"$4}' | uniq -c | sort -nr | head -3

Expected Output:

   742 19/Jun/2026:14:07
   698 19/Jun/2026:14:08
   711 19/Jun/2026:14:09

Roughly 700 requests per minute from one IP — far above any legitimate crawler — confirms this is abusive and safe to throttle hard.

Step 5: Block or rate-limit in Nginx with a verified-crawler safety valve.
Build a deny map from the fake-IP list and apply a strict limit_req zone. Generate the map file from your verdicts:

{ echo 'geo $fake_gbot {'; echo '    default 0;'; \
  awk '{print "    "$1" 1;"}' fake_gbot.txt; echo '}'; } > /etc/nginx/conf.d/fake_gbot.conf
head -3 /etc/nginx/conf.d/fake_gbot.conf

Expected Output:

geo $fake_gbot {
    default 0;
    45.143.200.18 1;

Then reference it in the server config. Requests from flagged IPs get a tight rate limit; everyone else is unaffected:

# http {} context
limit_req_zone $binary_remote_addr zone=fakebots:10m rate=10r/m;

# server {} or location {} context
if ($fake_gbot) { set $limit_key $binary_remote_addr; }
limit_req zone=fakebots burst=5 nodelay;
# return 403 outright for the worst offenders:
if ($fake_gbot) { return 403; }

Validate and reload:

nginx -t && systemctl reload nginx

Expected Output:

nginx: configuration file /etc/nginx/nginx.conf test is successful

Production Warning: Never block by user-agent alone, and never block an IP that passes FCrDNS. Genuine Googlebot must stay on the allow path or you risk deindexing. Build deny lists exclusively from IPs that failed forward-confirmed reverse DNS (the fake_gbot.txt set), refresh them on a schedule because attacker IPs rotate, and prefer rate-limiting over hard 403 for borderline cases so a misclassification degrades gracefully instead of dropping a real crawler.

Edge Cases

Traffic behind a CDN or proxy. If Nginx sits behind Cloudflare or a load balancer, $remote_addr is the proxy IP, not the client's. FCrDNS will fail for every request because you are resolving the proxy. Restore the true client IP from X-Forwarded-For / CF-Connecting-IP (via set_real_ip_from and real_ip_header) before extracting $1 from logs or building deny maps, otherwise you will mislabel real Googlebot as fake.

Cached or stale DNS verdicts. Attacker IPs rotate, and a previously-fake IP may later be reassigned to a legitimate network (or vice versa). Treat the fake_gbot.txt deny list as time-boxed: regenerate it from fresh logs daily and expire entries older than a few days rather than maintaining a permanent blocklist that accretes stale, possibly-reassigned addresses.

Verification

Confirm the block is working: after reload, the fake IPs should receive 403 and disappear from successful crawler stats. Tail the log for a known impostor and check its status codes:

top=$(head -1 fake_gbot.txt)
grep -Fw "$top" access.log | awk '{print $9}' | sort | uniq -c

Expected Output:

   2014 403

The impostor now gets only 403. Re-run the fake-share calculation from Step 3 on tomorrow's log and confirm the percentage drops sharply, while real_gbot.txt IPs continue returning 200/304 untouched.

Common Mistakes

  • Blocking by user-agent string. Adding if ($http_user_agent ~ Googlebot) { return 403; } blocks the real Googlebot too and can deindex your site. The user-agent is the thing being spoofed; never key access decisions on it. Key on FCrDNS verdict instead.
  • Forgetting the real client IP behind a CDN. Building deny maps from proxy IPs blocks all traffic or none and labels genuine crawlers as fake. Configure real_ip_header so $remote_addr is the actual client before you verify or block.
  • Treating the deny list as permanent. Spoofer IP pools rotate constantly. A static blocklist grows stale, misses new attackers, and may eventually block a reassigned legitimate host. Regenerate from fresh logs on a schedule and expire old entries.

Frequently Asked Questions

How much of my "Googlebot" traffic is typically fake?
It varies widely by site, but on exposed WordPress and e-commerce sites it is common for the majority of self-declared Googlebot requests to fail FCrDNS — often 80% or more, concentrated on login, XML-RPC, and .env probes. Always measure your own share with the Step 3 calculation rather than assuming; the number drives how aggressive your rate limit should be.

Can I just block by user-agent to stop fake Googlebot?
No. The user-agent is exactly what the impostors forge, so a user-agent rule blocks the real Googlebot alongside them and risks deindexing. The only safe basis for blocking is the forward-confirmed reverse DNS verdict: deny IPs that fail FCrDNS, and explicitly keep verified IPs on the allow path.

Will rate-limiting fake bots affect my crawl budget or rankings?
Not if you scope the limit to FCrDNS-failed IPs only. Real Googlebot passes verification and bypasses the limit, so its crawl access is unchanged. Removing spoofed load actually helps — it frees server capacity and de-noises the crawl-budget metrics you compute from logs, giving you a truer picture of genuine crawler behavior.

Part of the Identifying Search Engine Bots in Server Logs series.