Extracting the Top Bot User Agents from Logs

Knowing which bots consume the most requests is the fastest way to spot crawl-budget drain and abusive scrapers. But the User-Agent string is the hardest field to extract cleanly from a CLI: it contains spaces, so it spans multiple awk fields, and naive print $12 returns only a fragment. This guide shows the correct quote-aware extraction, then a sed/awk classifier that buckets thousands of distinct UA strings into a handful of named crawlers so you can see, at a glance, who is hammering your server. It sits inside the broader CLI One-Liners for Quick Audits cluster.

By the end you will extract the full User-Agent reliably regardless of internal spaces, rank raw UA strings by request volume, collapse them into buckets (Googlebot, Bingbot, AhrefsBot, GPTBot, and more), and flag aggressive non-search crawlers eating capacity. Everything runs against a standard combined-format access log.

Diagnosis: Why $12 Returns Garbage

In the combined log format the User-Agent is the final quoted field. A sample line:

17.241.75.10 - - [18/Jun/2026:09:22:10 +0000] "GET /pricing HTTP/1.1" 200 4821 "https://example.com/" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Because awk's default field separator is whitespace, Mozilla/5.0, (compatible;, and Googlebot/2.1; each land in a different numbered field. Printing $12 gives you only the first token:

awk '{print $12}' access.log | head -3

Expected Output (broken):

"Mozilla/5.0
"Mozilla/5.0
"Mozilla/5.0

That is useless — every browser and bot starts with Mozilla/5.0. You must capture the whole quoted string instead of a single field. The same quoting hazard underlies many of the patterns in awk and grep commands for log filtering.

Concept: Treat the Quote as the Delimiter

The robust fix is to change awk's field separator to the double-quote character. In a combined log there are quoted segments for the request line, the referrer, and the user-agent. Splitting on ", the user-agent becomes the 6th field ($6): field 1 is everything before the first quote, 2 is the request line, 3 is the gap, 4 is the referrer, 5 is the gap, and 6 is the user-agent. This is far more reliable than counting whitespace-delimited fields, which shifts whenever a UA contains a different number of spaces.

Step-by-Step: Extract and Rank the User-Agent

Step 1: Extract the full User-Agent with -F'"'
Use the double-quote as the field separator and print the sixth field.

awk -F'"' '{print $6}' access.log | head -3

Explanation: -F'"' splits each line on double-quotes, so $6 is the complete user-agent string including its internal spaces — no fragmentation.
Expected Output:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)

Step 2: Rank raw User-Agent strings by request volume
Count and sort the extracted strings to see the heaviest consumers.

awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -nr | head -20

Explanation: Collapses identical UA strings into counts and ranks them descending. The top of this list is where your request volume actually goes.
Expected Output:

  48210 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  31044 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
  12880 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
   9002 GPTBot/1.0 (+https://openai.com/gptbot)

Step 3: Restrict to bots only (skip real browser noise)
Human browser strings often outnumber bots and bury them. Filter to lines whose UA looks like a crawler before ranking.

awk -F'"' 'tolower($6) ~ /bot|crawler|spider|slurp/ {print $6}' access.log | sort | uniq -c | sort -nr | head -20

Explanation: tolower($6) normalizes case, and the regex keeps only user-agents advertising themselves as automated. This drops the bulk of human traffic and surfaces the crawler ranking directly.
Expected Output: The same shape as Step 2, but human browser entries are removed.

Step-by-Step: Bucket UA Strings into Named Crawlers

Raw strings carry version numbers and URLs that fragment the count — Googlebot/2.1 and Googlebot/2.2 rank separately even though both are Googlebot. A classifier collapses every variant into one named bucket.

Step 4: Build a sed/awk classifier
Map each UA to a canonical bucket name, then count the buckets.

awk -F'"' '{print $6}' access.log | \
sed -E '
  s/.*[Gg]ooglebot.*/Googlebot/;
  s/.*[Bb]ingbot.*/Bingbot/;
  s/.*AhrefsBot.*/AhrefsBot/;
  s/.*SemrushBot.*/SemrushBot/;
  s/.*GPTBot.*/GPTBot/;
  s/.*ClaudeBot.*/ClaudeBot/;
  s/.*[Bb]ytespider.*/Bytespider/;
  s/.*[Yy]andex.*/YandexBot/;
  t done
  s/.*[Bb]ot.*|.*[Cc]rawler.*|.*[Ss]pider.*/Other-Bot/;
  t done
  s/.*/Human-or-Unknown/
  :done
' | sort | uniq -c | sort -nr

Explanation: Each s/// rule rewrites any matching UA to a single bucket label. The t done branch jumps to the end on the first match so later rules cannot re-rewrite the line. Anything not matched as a named bot falls through to Other-Bot (if it self-identifies) or Human-or-Unknown. The final sort | uniq -c gives a clean per-crawler ranking.
Expected Output:

  91254 Human-or-Unknown
  48210 Googlebot
  31044 AhrefsBot
  12880 Bingbot
   9002 GPTBot
   4120 Other-Bot
   2204 SemrushBot

The classification logic — match, assign, stop — is shown below.

User-Agent classifier decision flow A raw User-Agent is tested against named bot rules in order; the first match assigns a bucket and stops, otherwise it falls through to Other-Bot or Human-or-Unknown. Raw User-Agent awk -F'"' $6 Match named rule? Googlebot, GPTBot... Assign bucket then stop (t done) Self-IDs as bot? Other-Bot Falls through Human-or-Unknown yes no no

Step 5: Flag aggressive non-search crawlers
Search engines (Googlebot, Bingbot) earn their crawl budget. SEO tools and AI scrapers (AhrefsBot, SemrushBot, GPTBot, Bytespider) often do not — and some hit harder than the search engines you actually want. Isolate the non-search buckets and rank them.

awk -F'"' 'tolower($6) ~ /ahrefs|semrush|mj12|dotbot|bytespider|gptbot|ccbot/ {print $6}' access.log | \
sed -E 's/.*[Aa]hrefs.*/AhrefsBot/; s/.*[Ss]emrush.*/SemrushBot/; s/.*GPTBot.*/GPTBot/; s/.*[Bb]ytespider.*/Bytespider/; s/.*CCBot.*/CCBot/; s/.*MJ12.*/MJ12bot/; s/.*[Dd]otbot.*/DotBot/' | \
sort | uniq -c | sort -nr

Explanation: The first awk keeps only known non-search crawler tokens, the sed collapses each to a bucket, and the tail ranks them. A single tool appearing near the top of your overall volume is a strong candidate for a robots.txt Crawl-delay or an outright block.
Expected Output:

  31044 AhrefsBot
   9002 GPTBot
   2204 SemrushBot
    880 Bytespider

Production Warning: Before blocking anything, confirm the bucket's share of total requests and check the status codes those bots receive. Blocking a crawler that drives referral or backlink value, or rate-limiting one that respects Crawl-delay, can do more harm than the traffic it removes. Measure first, then act.

Edge Cases

Edge Case — non-standard log format shifts $6: If your Nginx log_format reorders or adds quoted fields (for example an X-Forwarded-For or host field wrapped in quotes), the user-agent may not be $6. Confirm with awk -F'"' '{print NF": "$6}' access.log | head; if $6 is not the UA, count the quotes and adjust the field index.

Edge Case — empty or "-" user-agents: Many crawlers and probes send - or an empty UA. These land in Human-or-Unknown and can be large. Add a rule s/^-$/Empty-UA/ before the fall-through to track them separately, since a spike of empty UAs often signals scripted abuse worth correlating against your top 404 URLs.

Verification: Sanity-Check the Buckets

Confirm the bucket totals add up to your total line count — if they do not, a sed rule is double-firing or your field index is wrong.

TOTAL=$(wc -l < access.log)
BUCKETS=$(awk -F'"' '{print $6}' access.log | sed -E 's/.*[Gg]ooglebot.*/Googlebot/; s/.*[Bb]ingbot.*/Bingbot/; t; s/.*/Other/' | sort | uniq -c | awk '{s+=$1} END{print s}')
echo "lines=$TOTAL bucketed=$BUCKETS"

Expected Output: lines=200916 bucketed=200916 — the two numbers must match exactly. A mismatch means a classifier rule consumed or dropped lines.

For verifying that "Googlebot" and "Bingbot" buckets are genuinely the search engines they claim — not spoofers inflating your counts — pair this audit with identifying search engine bots in server logs and, for the specific impersonation case, detecting fake Googlebot traffic in access logs.

Common Mistakes

  • Printing $12 for the user-agent: Whitespace-delimited fields fragment the UA on internal spaces, so $12 is only "Mozilla/5.0. Always split on the quote with -F'"' and take $6.
  • Letting sed rules cascade: Without a t (branch-on-substitution) after each rule, a line matching Googlebot can be re-rewritten by a later catch-all, corrupting counts. Branch to the end on the first match.
  • Ranking raw strings instead of buckets: Version numbers in the UA split one crawler across many entries, understating its true share. Classify into named buckets before you decide who to throttle.

Frequently Asked Questions

Why is the User-Agent field $6 when I split on the quote, but $12 when I split on spaces?
Splitting on whitespace puts every space-separated token of the UA in its own numbered field, so the count is unstable. Splitting on the double-quote (-F'"') treats each quoted segment as one field; in a combined log the user-agent is the sixth quote-delimited field, which is stable regardless of how many spaces the UA contains.

Can I trust the user-agent to identify a bot?
No — the UA string is self-reported and trivially spoofed. Use it to rank and bucket traffic, but verify any bot you intend to block or trust (especially "Googlebot") with reverse-and-forward DNS before acting, as covered in the bot-identification guides.

How do I decide which non-search crawler to block?
Rank the non-search buckets (Step 5), check each one's share of total requests and the status codes it receives, then weigh the traffic cost against any referral or backlink value. Prefer a robots.txt Crawl-delay for cooperative bots and reserve hard blocks for crawlers that ignore it or send empty user-agents.

Part of the CLI One-Liners for Quick Audits series.