Identifying Search Engine Bots in Server Logs

Your access logs do not tell you the truth about who is crawling your site. Any client can claim to be Googlebot by setting a single HTTP header, and a large share of "Googlebot" traffic in the average log is spoofed scrapers, vulnerability scanners, or competitors mining your content. This page shows you how to separate legitimate search engine crawlers from impostors using both user-agent pattern matching and the only authoritative method that actually proves identity: reverse plus forward DNS verification against the operator's own infrastructure.

The goal is a repeatable classification workflow that you can trust before you make any decision based on bot traffic — whether that decision is rate limiting, crawl budget reporting, or firewall blocking. We will isolate crawler requests from the raw log, build a verification script that confirms ownership cryptographically through DNS, map the official user-agent tokens to their crawlers and purposes, and harden the whole pipeline against the failure modes that silently corrupt bot reports. This guide sits inside the broader Crawl Budget Optimization & Bot Management discipline, and pairs naturally with the terminal techniques in our CLI One-Liners for Quick Audits guide.

Prerequisites

Before running the verification workflow, confirm the following are in place:

  • Read access to raw access logs (/var/log/nginx/access.log or /var/log/apache2/access.log), including rotated .gz archives.
  • The user-agent field is logged. The Nginx combined format and Apache combined format both capture it; a bare common format does not. Verify before you start.
  • The real client IP is logged, not your load balancer or CDN edge IP. If you sit behind Cloudflare or Fastly, you must log CF-Connecting-IP / X-Forwarded-For or every verification will fail.
  • dig or host installed (dnsutils / bind-utils package) for DNS lookups. awk, grep, and sort are assumed present.
  • Outbound DNS permitted from the host running the script. Reverse DNS verification needs to resolve PTR and A records against public resolvers.

Establishing the Log Fields You Actually Need

Bot identification depends on exactly two fields: the client IP and the user-agent string. If either is wrong, every downstream conclusion is wrong. Start by confirming both are present and that the IP is the real visitor, not an intermediary.

Step 1: Confirm the User-Agent and Client IP Are Captured
Inspect a line and locate the fields. In the standard combined format the client IP is field $1 and the user-agent is the final quoted string.

head -n 1 /var/log/nginx/access.log

Expected Output: A line ending in a quoted user-agent, with a routable client IP at the start:

66.249.66.1 - - [19/Jun/2026:08:14:52 +0000] "GET /products/ HTTP/1.1" 200 5123 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Step 2: Verify the IP Is the Real Client, Not Your CDN
If field $1 shows a small set of repeating private or CDN-owned addresses, you are logging the proxy, not the visitor. Count distinct leading IPs to detect this.

awk '{print $1}' /var/log/nginx/access.log | sort -u | head

Explanation: A healthy public-facing log shows thousands of distinct client IPs. If you see only a handful (for example 10.0.0.x or Cloudflare ranges), reconfigure the log format to record the forwarded client IP before proceeding.

Production Warning: Never run rate limiting or firewall blocks off a log where $1 is the CDN edge. You will either block your own CDN (taking the site down) or block nobody. Fix the log format first.

Building the Verification Pipeline

User-agent matching narrows the field; DNS verification proves identity. The pipeline runs in two numbered stages: segment candidate crawler requests with grep/awk, then verify each unique IP with a reverse-then-forward DNS script.

Step 1: Segment Candidate Crawler Requests
Pull every request whose user-agent claims to be a major crawler, then reduce to unique IPs. Case-insensitive matching (-iE) catches capitalization variants, and the same regex discipline from our awk and grep commands for log filtering reference applies here.

grep -iE 'googlebot|bingbot|googleother|yandex|duckduckbot|applebot' /var/log/nginx/access.log \
  | awk '{print $1}' | sort | uniq -c | sort -nr > /tmp/candidate_bot_ips.txt
head /tmp/candidate_bot_ips.txt

Explanation: Filters lines whose user-agent matches a crawler token, isolates the client IP ($1), and ranks IPs by request volume. The output is your candidate set — claims, not confirmations.
Expected Output: A ranked list of IPs that claim to be crawlers:

  4821 66.249.66.1
  1190 40.77.167.50
   903 66.249.66.4
    77 185.220.101.34

Production Warning: Do not act on this list yet. Every IP here is unverified. The last entry above is a Tor exit node masquerading as a crawler — the exact case verification exists to catch.

Step 2: Verify Each Candidate IP with Reverse + Forward DNS
The authoritative test, published by Google, Bing, and Yandex, is: run a reverse DNS (PTR) lookup on the IP, confirm the hostname ends in the operator's official domain, then run a forward DNS (A/AAAA) lookup on that hostname and confirm it resolves back to the original IP. Both directions must agree. Save this as verify_bots.sh:

#!/usr/bin/env bash
# verify_bots.sh — confirm crawler IPs via reverse + forward DNS
# Usage: awk '{print $2}' /tmp/candidate_bot_ips.txt | ./verify_bots.sh

# Official PTR suffixes for legitimate crawlers
LEGIT='googlebot\.com|google\.com|search\.msn\.com|crawl\.yandex\.(net|com|ru)|applebot\.apple\.com'

while read -r ip; do
  [ -z "$ip" ] && continue
  # Reverse lookup (PTR) with a 3s timeout so dead IPs cannot hang the run
  host=$(dig +short +time=3 +tries=1 -x "$ip" | sed 's/\.$//')
  if [ -z "$host" ]; then
    echo "FAKE   $ip  (no PTR record)"
    continue
  fi
  if ! echo "$host" | grep -qE "$LEGIT"; then
    echo "FAKE   $ip  ($host — not an official crawler domain)"
    continue
  fi
  # Forward lookup must resolve the hostname back to the same IP
  fwd=$(dig +short +time=3 +tries=1 "$host" | tail -n1)
  if [ "$fwd" = "$ip" ]; then
    echo "VERIFIED $ip  ($host)"
  else
    echo "FAKE   $ip  ($host forward-resolves to $fwd, not $ip)"
  fi
done

Run it against the candidate IPs:

awk '{print $2}' /tmp/candidate_bot_ips.txt | bash verify_bots.sh

Explanation: For each IP, the script does a PTR lookup, rejects hostnames that do not end in an official crawler domain, then forward-resolves the hostname and confirms it returns the original IP. Only IPs that pass both checks are labeled VERIFIED. The full reverse-DNS rationale is covered in depth in our guide to verifying Googlebot with reverse DNS lookup.
Expected Output:

VERIFIED 66.249.66.1  (crawl-66-249-66-1.googlebot.com)
VERIFIED 40.77.167.50  (msnbot-40-77-167-50.search.msn.com)
VERIFIED 66.249.66.4  (crawl-66-249-66-4.googlebot.com)
FAKE   185.220.101.34  (no PTR record)

Production Warning: Reverse DNS verification is the operator-endorsed test, but it issues two live DNS queries per IP. Deduplicate to unique IPs first (already done above) and rate-limit large batches; firing thousands of dig calls in a tight loop can trip your resolver's query limits or get you throttled. Cache VERIFIED results for 24 hours rather than re-resolving every run.

The Verification Decision Flow

The diagram below is the exact logic encoded in verify_bots.sh. A user-agent claim is never enough on its own; only a clean round-trip through reverse and forward DNS yields a VERIFIED verdict.

Bot verification decision flow Flowchart: if the user-agent matches a crawler token, run reverse DNS, check the hostname domain, run forward DNS, and confirm it resolves back to the original IP before marking the request VERIFIED; any failure marks it FAKE. Log line with user-agent UA matches a crawler token? no Human / other UA yes Reverse DNS host ends in official domain? yes Forward DNS resolves back to same IP? yes VERIFIED crawler no PTR wrong domain FAKE / spoofed IP mismatch

Parsing Logic & Crawler Classification

Once you can verify identity, the user-agent token still matters — it tells you which crawler and why it is visiting, which drives crawl budget decisions. The table below maps the official tokens you will encounter to their operator and purpose. Match on the bolded token substring, then confirm with DNS.

User-Agent token Crawler Operator Purpose Verifies against
Googlebot Googlebot (Smartphone/Desktop) Google Primary web index crawl *.googlebot.com
Googlebot-Image Googlebot Image Google Google Images indexing *.googlebot.com
Googlebot-News / Googlebot-Video Googlebot News/Video Google Vertical indexing *.googlebot.com
GoogleOther GoogleOther Google Non-search fetches (research, product teams) *.googlebot.com
Google-InspectionTool Search Console inspector Google URL Inspection / Rich Results tests *.google.com
Storebot-Google Google StoreBot Google Shopping / product crawl *.googlebot.com
bingbot Bingbot Microsoft Bing web index crawl *.search.msn.com
BingPreview Bing Preview Microsoft Page snapshot rendering *.search.msn.com
YandexBot YandexBot Yandex Yandex web index crawl *.crawl.yandex.net
YandexImages Yandex Images Yandex Yandex image index *.crawl.yandex.net
DuckDuckBot DuckDuckBot DuckDuckGo DuckDuckGo index Published IP list
Applebot Applebot Apple Siri / Spotlight suggestions *.applebot.apple.com

SEO callout — GoogleOther vs Googlebot. GoogleOther traffic does not feed the search index. If a large fraction of your "Google" crawl budget is GoogleOther hitting low-value paths, that is wasted server load you can deprioritize without harming rankings. Always split these tokens apart in reporting rather than collapsing them into one "Google" bucket.

SEO callout — pair tokens with status codes. Classifying the crawler is only half the picture; what the server returned to it determines indexing outcomes. Cross-reference verified crawler hits with response codes using our reference on understanding HTTP status codes in server logs — a verified Googlebot receiving a wall of 5xx or soft 404 responses is a crawl budget emergency.

SEO callout — verified-only reporting. Every crawl report you present to stakeholders should be built from the VERIFIED set, not the raw user-agent match. Reporting unverified counts inflates "Googlebot activity" with scraper noise and leads to wrong capacity and content decisions.

Validation & Troubleshooting

DNS-based verification is robust but has well-known failure modes. Each one below produces a specific wrong answer, so learn to recognize the symptom and apply the named fix.

Failure mode: spoofed user-agent passes the regex but fails DNS.
Symptom: verify_bots.sh labels a high-volume IP FAKE (no PTR record) or FAKE (... not an official crawler domain). This is the system working correctly — a scraper set User-Agent: Googlebot but cannot forge Google's reverse DNS.
Fix: Treat the regex output as candidates only and gate every action on the VERIFIED verdict. For a deeper treatment of distinguishing the impostors, see detecting fake Googlebot traffic in access logs.

Failure mode: CDN-masked client IP.
Symptom: Every candidate IP reverse-resolves to your CDN's domain (for example *.cloudflare.com), and nothing is ever VERIFIED.
Fix: You are verifying the proxy, not the crawler. Re-extract the real client from the forwarded header instead of field $1. With a combined-plus format that appends $http_x_forwarded_for, pull the first IP in that list:

awk '{print $(NF)}' /var/log/nginx/access.log | awk -F',' '{print $1}' | sort -u | head

Expected Output: Routable public client IPs (for example 66.249.66.1) rather than CDN edge addresses. Feed these into the verification script.

Failure mode: IPv6 crawler addresses.
Symptom: Verified crawlers are missing entirely, and the candidate list shows long colon-delimited addresses such as 2001:4860:4801:.... Google and others increasingly crawl over IPv6.
Fix: The reverse/forward logic is identical for IPv6 — dig -x handles AAAA and IP6.ARPA transparently. Confirm your regex and field extraction are not silently dropping colons. Test a single address directly:

dig +short -x 2001:4860:4801:0000:0000:0000:0000:0066

Expected Output: A *.googlebot.com hostname, proving IPv6 verification works the same way; if you got nothing, your candidate extraction truncated the address.

Failure mode: rDNS timeouts and resolver throttling.
Symptom: Long batch runs slow to a crawl or start returning empty host results that get mislabeled FAKE (no PTR record), even for IPs you know are real.
Fix: The +time=3 +tries=1 flags already cap each lookup, but on large sets also deduplicate IPs first (done in Step 1), cache VERIFIED results for 24 hours, and run a local caching resolver (unbound or dnsmasq) so repeated queries do not leave the host. Re-run any no PTR record IPs once before treating them as fake — transient resolver failures are not proof of spoofing.

Common Mistakes

  • Trusting the user-agent string outright. The HTTP User-Agent header is attacker-controlled. Any client can set Googlebot. Treating a regex match as proof of identity is the single most common error — it inflates crawl stats and invites scrapers to bypass your protections by impersonating crawlers.
  • Skipping the forward-DNS step. Reverse DNS alone is forgeable on networks where an attacker controls the PTR record. The round-trip (PTR then A, resolving back to the same IP) is what makes verification authoritative. Stopping after the reverse lookup is a real-world bypass.
  • Verifying the CDN IP instead of the client. Behind Cloudflare or Fastly, field $1 is the edge, not the visitor. Verification will fail universally and you will conclude "no real crawlers," when in fact you are reading the wrong field. Log and parse the forwarded client IP.
  • Hardcoding only Google's domains. Bingbot verifies against search.msn.com and Yandex against crawl.yandex.net. A verification script that only knows googlebot.com will mislabel every legitimate Bing and Yandex crawl as fake.
  • Re-resolving every IP on every run. Firing live DNS queries for thousands of repeated IPs each cycle is slow and gets you throttled. Deduplicate and cache verified verdicts; identity does not change minute to minute.

Frequently Asked Questions

Why can't I just trust the Googlebot user-agent string in my logs?
Because the User-Agent header is set entirely by the client and can be any value. Scrapers, scanners, and competitors routinely send User-Agent: Googlebot to slip past rate limits and access crawler-only content. The only way to confirm a request truly came from Google is to verify the source IP with a reverse DNS lookup that resolves to a googlebot.com host, followed by a forward DNS lookup that resolves that host back to the same IP.

What is the difference between reverse DNS and forward DNS in bot verification?
Reverse DNS (a PTR lookup) takes the IP and returns a hostname — for a real Googlebot, something ending in .googlebot.com. Forward DNS (an A or AAAA lookup) takes that hostname and returns its IP. Verification requires both to agree: the hostname must be on an official domain and forward-resolve back to the original IP. One direction alone can be spoofed; the round-trip cannot, because the attacker does not control the operator's authoritative DNS.

Should I block traffic that fails verification?
Not blindly. A failed verification means "not a confirmed crawler," which includes legitimate human users, your own monitoring, and tools that happen to carry a crawler-like user-agent. Use the verified set to report and whitelist real crawlers, and apply blocking only to clients that both claim a crawler identity and fail DNS verification — those are unambiguous impostors. Always verify against the real client IP, never the CDN edge.

How often do search engines publish IP ranges, and should I use them instead of DNS?
Google, Bing, and others publish JSON lists of their crawler IP ranges, which you can match against as a faster first pass. They are authoritative but change over time, so you must refresh them on a schedule. Reverse/forward DNS verification needs no maintained list and self-updates as operators rotate IPs, which is why it remains the recommended primary method. Many teams combine both: a published-range check for speed, with DNS verification as the source of truth.

Part of the Crawl Budget Optimization & Bot Management series.