Identifying Search Engine Bots in Server Logs
Your access logs do not tell you the truth about who is crawling your site. Any client can claim to be Googlebot by setting a single HTTP header, and a large share of "Googlebot" traffic in the average log is spoofed scrapers, vulnerability scanners, or competitors mining your content. This page shows you how to separate legitimate search engine crawlers from impostors using both user-agent pattern matching and the only authoritative method that actually proves identity: reverse plus forward DNS verification against the operator's own infrastructure.
The goal is a repeatable classification workflow that you can trust before you make any decision based on bot traffic — whether that decision is rate limiting, crawl budget reporting, or firewall blocking. We will isolate crawler requests from the raw log, build a verification script that confirms ownership cryptographically through DNS, map the official user-agent tokens to their crawlers and purposes, and harden the whole pipeline against the failure modes that silently corrupt bot reports. This guide sits inside the broader Crawl Budget Optimization & Bot Management discipline, and pairs naturally with the terminal techniques in our CLI One-Liners for Quick Audits guide.
Prerequisites
Before running the verification workflow, confirm the following are in place:
- Read access to raw access logs (
/var/log/nginx/access.logor/var/log/apache2/access.log), including rotated.gzarchives. - The user-agent field is logged. The Nginx
combinedformat and Apachecombinedformat both capture it; a barecommonformat does not. Verify before you start. - The real client IP is logged, not your load balancer or CDN edge IP. If you sit behind Cloudflare or Fastly, you must log
CF-Connecting-IP/X-Forwarded-Foror every verification will fail. digorhostinstalled (dnsutils/bind-utilspackage) for DNS lookups.awk,grep, andsortare assumed present.- Outbound DNS permitted from the host running the script. Reverse DNS verification needs to resolve PTR and A records against public resolvers.
Establishing the Log Fields You Actually Need
Bot identification depends on exactly two fields: the client IP and the user-agent string. If either is wrong, every downstream conclusion is wrong. Start by confirming both are present and that the IP is the real visitor, not an intermediary.
Step 1: Confirm the User-Agent and Client IP Are Captured
Inspect a line and locate the fields. In the standard combined format the client IP is field $1 and the user-agent is the final quoted string.
head -n 1 /var/log/nginx/access.log
Expected Output: A line ending in a quoted user-agent, with a routable client IP at the start:
66.249.66.1 - - [19/Jun/2026:08:14:52 +0000] "GET /products/ HTTP/1.1" 200 5123 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Step 2: Verify the IP Is the Real Client, Not Your CDN
If field $1 shows a small set of repeating private or CDN-owned addresses, you are logging the proxy, not the visitor. Count distinct leading IPs to detect this.
awk '{print $1}' /var/log/nginx/access.log | sort -u | head
Explanation: A healthy public-facing log shows thousands of distinct client IPs. If you see only a handful (for example 10.0.0.x or Cloudflare ranges), reconfigure the log format to record the forwarded client IP before proceeding.
Production Warning: Never run rate limiting or firewall blocks off a log where $1 is the CDN edge. You will either block your own CDN (taking the site down) or block nobody. Fix the log format first.
Building the Verification Pipeline
User-agent matching narrows the field; DNS verification proves identity. The pipeline runs in two numbered stages: segment candidate crawler requests with grep/awk, then verify each unique IP with a reverse-then-forward DNS script.
Step 1: Segment Candidate Crawler Requests
Pull every request whose user-agent claims to be a major crawler, then reduce to unique IPs. Case-insensitive matching (-iE) catches capitalization variants, and the same regex discipline from our awk and grep commands for log filtering reference applies here.
grep -iE 'googlebot|bingbot|googleother|yandex|duckduckbot|applebot' /var/log/nginx/access.log \
| awk '{print $1}' | sort | uniq -c | sort -nr > /tmp/candidate_bot_ips.txt
head /tmp/candidate_bot_ips.txt
Explanation: Filters lines whose user-agent matches a crawler token, isolates the client IP ($1), and ranks IPs by request volume. The output is your candidate set — claims, not confirmations.
Expected Output: A ranked list of IPs that claim to be crawlers:
4821 66.249.66.1
1190 40.77.167.50
903 66.249.66.4
77 185.220.101.34
Production Warning: Do not act on this list yet. Every IP here is unverified. The last entry above is a Tor exit node masquerading as a crawler — the exact case verification exists to catch.
Step 2: Verify Each Candidate IP with Reverse + Forward DNS
The authoritative test, published by Google, Bing, and Yandex, is: run a reverse DNS (PTR) lookup on the IP, confirm the hostname ends in the operator's official domain, then run a forward DNS (A/AAAA) lookup on that hostname and confirm it resolves back to the original IP. Both directions must agree. Save this as verify_bots.sh:
#!/usr/bin/env bash
# verify_bots.sh — confirm crawler IPs via reverse + forward DNS
# Usage: awk '{print $2}' /tmp/candidate_bot_ips.txt | ./verify_bots.sh
# Official PTR suffixes for legitimate crawlers
LEGIT='googlebot\.com|google\.com|search\.msn\.com|crawl\.yandex\.(net|com|ru)|applebot\.apple\.com'
while read -r ip; do
[ -z "$ip" ] && continue
# Reverse lookup (PTR) with a 3s timeout so dead IPs cannot hang the run
host=$(dig +short +time=3 +tries=1 -x "$ip" | sed 's/\.$//')
if [ -z "$host" ]; then
echo "FAKE $ip (no PTR record)"
continue
fi
if ! echo "$host" | grep -qE "$LEGIT"; then
echo "FAKE $ip ($host — not an official crawler domain)"
continue
fi
# Forward lookup must resolve the hostname back to the same IP
fwd=$(dig +short +time=3 +tries=1 "$host" | tail -n1)
if [ "$fwd" = "$ip" ]; then
echo "VERIFIED $ip ($host)"
else
echo "FAKE $ip ($host forward-resolves to $fwd, not $ip)"
fi
done
Run it against the candidate IPs:
awk '{print $2}' /tmp/candidate_bot_ips.txt | bash verify_bots.sh
Explanation: For each IP, the script does a PTR lookup, rejects hostnames that do not end in an official crawler domain, then forward-resolves the hostname and confirms it returns the original IP. Only IPs that pass both checks are labeled VERIFIED. The full reverse-DNS rationale is covered in depth in our guide to verifying Googlebot with reverse DNS lookup.
Expected Output:
VERIFIED 66.249.66.1 (crawl-66-249-66-1.googlebot.com)
VERIFIED 40.77.167.50 (msnbot-40-77-167-50.search.msn.com)
VERIFIED 66.249.66.4 (crawl-66-249-66-4.googlebot.com)
FAKE 185.220.101.34 (no PTR record)
Production Warning: Reverse DNS verification is the operator-endorsed test, but it issues two live DNS queries per IP. Deduplicate to unique IPs first (already done above) and rate-limit large batches; firing thousands of dig calls in a tight loop can trip your resolver's query limits or get you throttled. Cache VERIFIED results for 24 hours rather than re-resolving every run.
The Verification Decision Flow
The diagram below is the exact logic encoded in verify_bots.sh. A user-agent claim is never enough on its own; only a clean round-trip through reverse and forward DNS yields a VERIFIED verdict.
Parsing Logic & Crawler Classification
Once you can verify identity, the user-agent token still matters — it tells you which crawler and why it is visiting, which drives crawl budget decisions. The table below maps the official tokens you will encounter to their operator and purpose. Match on the bolded token substring, then confirm with DNS.
| User-Agent token | Crawler | Operator | Purpose | Verifies against |
|---|---|---|---|---|
Googlebot |
Googlebot (Smartphone/Desktop) | Primary web index crawl | *.googlebot.com |
|
Googlebot-Image |
Googlebot Image | Google Images indexing | *.googlebot.com |
|
Googlebot-News / Googlebot-Video |
Googlebot News/Video | Vertical indexing | *.googlebot.com |
|
GoogleOther |
GoogleOther | Non-search fetches (research, product teams) | *.googlebot.com |
|
Google-InspectionTool |
Search Console inspector | URL Inspection / Rich Results tests | *.google.com |
|
Storebot-Google |
Google StoreBot | Shopping / product crawl | *.googlebot.com |
|
bingbot |
Bingbot | Microsoft | Bing web index crawl | *.search.msn.com |
BingPreview |
Bing Preview | Microsoft | Page snapshot rendering | *.search.msn.com |
YandexBot |
YandexBot | Yandex | Yandex web index crawl | *.crawl.yandex.net |
YandexImages |
Yandex Images | Yandex | Yandex image index | *.crawl.yandex.net |
DuckDuckBot |
DuckDuckBot | DuckDuckGo | DuckDuckGo index | Published IP list |
Applebot |
Applebot | Apple | Siri / Spotlight suggestions | *.applebot.apple.com |
SEO callout — GoogleOther vs Googlebot. GoogleOther traffic does not feed the search index. If a large fraction of your "Google" crawl budget is GoogleOther hitting low-value paths, that is wasted server load you can deprioritize without harming rankings. Always split these tokens apart in reporting rather than collapsing them into one "Google" bucket.
SEO callout — pair tokens with status codes. Classifying the crawler is only half the picture; what the server returned to it determines indexing outcomes. Cross-reference verified crawler hits with response codes using our reference on understanding HTTP status codes in server logs — a verified Googlebot receiving a wall of 5xx or soft 404 responses is a crawl budget emergency.
SEO callout — verified-only reporting. Every crawl report you present to stakeholders should be built from the VERIFIED set, not the raw user-agent match. Reporting unverified counts inflates "Googlebot activity" with scraper noise and leads to wrong capacity and content decisions.
Validation & Troubleshooting
DNS-based verification is robust but has well-known failure modes. Each one below produces a specific wrong answer, so learn to recognize the symptom and apply the named fix.
Failure mode: spoofed user-agent passes the regex but fails DNS.
Symptom: verify_bots.sh labels a high-volume IP FAKE (no PTR record) or FAKE (... not an official crawler domain). This is the system working correctly — a scraper set User-Agent: Googlebot but cannot forge Google's reverse DNS.
Fix: Treat the regex output as candidates only and gate every action on the VERIFIED verdict. For a deeper treatment of distinguishing the impostors, see detecting fake Googlebot traffic in access logs.
Failure mode: CDN-masked client IP.
Symptom: Every candidate IP reverse-resolves to your CDN's domain (for example *.cloudflare.com), and nothing is ever VERIFIED.
Fix: You are verifying the proxy, not the crawler. Re-extract the real client from the forwarded header instead of field $1. With a combined-plus format that appends $http_x_forwarded_for, pull the first IP in that list:
awk '{print $(NF)}' /var/log/nginx/access.log | awk -F',' '{print $1}' | sort -u | head
Expected Output: Routable public client IPs (for example 66.249.66.1) rather than CDN edge addresses. Feed these into the verification script.
Failure mode: IPv6 crawler addresses.
Symptom: Verified crawlers are missing entirely, and the candidate list shows long colon-delimited addresses such as 2001:4860:4801:.... Google and others increasingly crawl over IPv6.
Fix: The reverse/forward logic is identical for IPv6 — dig -x handles AAAA and IP6.ARPA transparently. Confirm your regex and field extraction are not silently dropping colons. Test a single address directly:
dig +short -x 2001:4860:4801:0000:0000:0000:0000:0066
Expected Output: A *.googlebot.com hostname, proving IPv6 verification works the same way; if you got nothing, your candidate extraction truncated the address.
Failure mode: rDNS timeouts and resolver throttling.
Symptom: Long batch runs slow to a crawl or start returning empty host results that get mislabeled FAKE (no PTR record), even for IPs you know are real.
Fix: The +time=3 +tries=1 flags already cap each lookup, but on large sets also deduplicate IPs first (done in Step 1), cache VERIFIED results for 24 hours, and run a local caching resolver (unbound or dnsmasq) so repeated queries do not leave the host. Re-run any no PTR record IPs once before treating them as fake — transient resolver failures are not proof of spoofing.
Common Mistakes
- Trusting the user-agent string outright. The HTTP
User-Agentheader is attacker-controlled. Any client can setGooglebot. Treating a regex match as proof of identity is the single most common error — it inflates crawl stats and invites scrapers to bypass your protections by impersonating crawlers. - Skipping the forward-DNS step. Reverse DNS alone is forgeable on networks where an attacker controls the PTR record. The round-trip (PTR then A, resolving back to the same IP) is what makes verification authoritative. Stopping after the reverse lookup is a real-world bypass.
- Verifying the CDN IP instead of the client. Behind Cloudflare or Fastly, field
$1is the edge, not the visitor. Verification will fail universally and you will conclude "no real crawlers," when in fact you are reading the wrong field. Log and parse the forwarded client IP. - Hardcoding only Google's domains. Bingbot verifies against
search.msn.comand Yandex againstcrawl.yandex.net. A verification script that only knowsgooglebot.comwill mislabel every legitimate Bing and Yandex crawl as fake. - Re-resolving every IP on every run. Firing live DNS queries for thousands of repeated IPs each cycle is slow and gets you throttled. Deduplicate and cache verified verdicts; identity does not change minute to minute.
Frequently Asked Questions
Why can't I just trust the Googlebot user-agent string in my logs?
Because the User-Agent header is set entirely by the client and can be any value. Scrapers, scanners, and competitors routinely send User-Agent: Googlebot to slip past rate limits and access crawler-only content. The only way to confirm a request truly came from Google is to verify the source IP with a reverse DNS lookup that resolves to a googlebot.com host, followed by a forward DNS lookup that resolves that host back to the same IP.
What is the difference between reverse DNS and forward DNS in bot verification?
Reverse DNS (a PTR lookup) takes the IP and returns a hostname — for a real Googlebot, something ending in .googlebot.com. Forward DNS (an A or AAAA lookup) takes that hostname and returns its IP. Verification requires both to agree: the hostname must be on an official domain and forward-resolve back to the original IP. One direction alone can be spoofed; the round-trip cannot, because the attacker does not control the operator's authoritative DNS.
Should I block traffic that fails verification?
Not blindly. A failed verification means "not a confirmed crawler," which includes legitimate human users, your own monitoring, and tools that happen to carry a crawler-like user-agent. Use the verified set to report and whitelist real crawlers, and apply blocking only to clients that both claim a crawler identity and fail DNS verification — those are unambiguous impostors. Always verify against the real client IP, never the CDN edge.
How often do search engines publish IP ranges, and should I use them instead of DNS?
Google, Bing, and others publish JSON lists of their crawler IP ranges, which you can match against as a faster first pass. They are authoritative but change over time, so you must refresh them on a schedule. Reverse/forward DNS verification needs no maintained list and self-updates as operators rotate IPs, which is why it remains the recommended primary method. Many teams combine both: a published-range check for speed, with DNS verification as the source of truth.
Related Guides
- Verifying Googlebot with Reverse DNS Lookup — the full reverse/forward DNS procedure, step by step.
- Detecting Fake Googlebot Traffic in Access Logs — isolating and acting on spoofed crawler traffic.
- awk and grep Commands for Log Filtering — the pattern-matching foundation for segmenting crawler requests.
- CLI One-Liners for Quick Audits — fast terminal diagnostics for crawl and error triage.
- Understanding HTTP Status Codes in Server Logs — pair verified crawler hits with the responses they received.
Part of the Crawl Budget Optimization & Bot Management series.