Querying Crawl Data with LogQL

Once nginx access logs are flowing into Grafana Loki, the index gives you almost nothing on its own — the analysis happens in LogQL. This page is a recipe book for crawl analysis on Loki-stored logs: how to select the right streams, filter lines for a specific bot, parse status and url out of the raw line at query time, and then turn matched lines into the metrics SEO teams actually report on — Googlebot request rate, status-code distribution, the most-crawled URLs, and crawl volume per hour. It assumes you already have a working aggregator from the parent Grafana Loki for SEO log aggregation setup, with bot, status, and method as low-cardinality labels and path/remote_addr kept out of the index.

The single rule that shapes every query below: parse at query time, not as labels. The URL is the highest-value SEO dimension and the highest-cardinality field; it must never become a label. LogQL lets you recover it from the log line with | json, | logfmt, or | pattern exactly when a query needs it, with zero cardinality cost. Get comfortable with that boundary and the rest is composition.

Anatomy of a LogQL Query

Every LogQL query has two stages. First, a stream selector in {} matches label sets against the index — this is the only part that touches the index, so it must be cheap and specific. Second, an optional pipeline of line filters, parsers, and label-formatters runs against the raw lines of the matched streams. Wrap the whole thing in a metric function (rate, count_over_time, sum, topk) and a log query becomes a metric query you can chart.

{job="nginx", bot="Googlebot"} |= "GET" | json | status="404"

Explanation: {job="nginx", bot="Googlebot"} is the index-time selector; |= "GET" is a fast substring line filter; | json parses the JSON line into label-like fields; status="404" filters on a parsed field. The selector narrows millions of lines to one bot's streams before any line-level work happens, which is what keeps the query fast.

Diagnosis — confirm your labels first. Before writing recipes, verify which labels exist, or every selector will silently match nothing.

logcli --addr=http://localhost:3100 labels
logcli --addr=http://localhost:3100 series '{job="nginx"}'

Expected Output: the first lists label names (bot, host, job, method, status); the second lists concrete streams like {bot="Googlebot", host="web-01", job="nginx", method="GET", status="200"}. If bot is missing, fix the Promtail pipeline before continuing.

Selecting Streams and Filtering Lines

Start every crawl query by narrowing to the smallest set of streams that can contain your answer, then refine on the line. Label matchers support = (exact), != (not), =~ (regex match), and !~ (regex not-match).

Step 1: Select a single bot's streams.
Use an exact match on the bot label to hit only Googlebot streams in the index.

{job="nginx", bot="Googlebot"}

Expected Output: In Grafana's Explore Logs view, a stream of raw Googlebot access-log lines, each annotated with its label set. This is the cheapest possible crawl selector because bot is a label.

Step 2: Match several crawlers with a regex matcher.
To compare crawlers, widen the selector with =~ rather than running three queries.

{job="nginx", bot=~"Googlebot|bingbot|YandexBot"}

Expected Output: interleaved lines from all three crawlers. The Grafana log panel shows a per-stream color key so you can eyeball volume differences before quantifying them.

Step 3: Add a line filter for a specific signature.
When a value is in the line but not a label — such as the literal Googlebot user-agent token, or a URL prefix — use |= (contains), != (not contains), |~ (regex), or !~. Line filters run in sequence and short-circuit, so put the most selective one first.

{job="nginx"} |= "Googlebot" |= "/blog/" != "Googlebot-Image"

Explanation: This finds desktop/smartphone Googlebot hits on /blog/ URLs while excluding the image crawler, all without any of those values being labels. |= is a raw substring scan over chunk contents — fast, but only run after the {} selector has already shrunk the candidate set.

Parsing status and url at Query Time

Line filters find lines; parsers turn a line into fields you can group and aggregate by. LogQL ships three you will use constantly. Pick by log format.

Parser Best for Example Extracts
` json` JSON access logs `
` logfmt` key=value logs `
` pattern` fixed combined format `
` regexp` irregular/legacy lines `

Step 1: Parse JSON logs into status and url.
If nginx emits structured JSON logging, the json parser is the cleanest path. Map only the JSON keys you need to keep the pipeline fast.

{job="nginx", bot="Googlebot"} | json status="status", url="request_uri"

Expected Output: each matched line now carries query-time fields status and url. In Explore, expand a log line to see them listed under Fields, distinct from the indexed Labels.

Step 2: Parse the combined text format with pattern.
For default nginx combined logs, the pattern parser is faster and far more readable than regex. Each <name> captures a field; <_> skips a section.

{job="nginx", bot="Googlebot"}
  | pattern `<_> - - <_> "<method> <url> <_>" <status> <size> <_>`

Expected Output: method, url, status, and size become available as parsed fields. Verify by adding | status="404" — the line count should drop to only the 404 responses.

Step 3: Reshape fields with label_format.
label_format rewrites or derives query-time fields, which is invaluable for bucketing. Here we collapse a parsed status into a status class (2xx, 3xx, 4xx, 5xx) for cleaner aggregation.

{job="nginx", bot="Googlebot"} | json status="status"
  | label_format class=`{{ printf "%.1sxx" .status }}`

Explanation: printf "%.1sxx" takes the first character of status and appends xx, so 404 becomes 4xx. Grouping on class instead of raw status gives a four-bucket chart, ideal when you only care about the response-class mix, not each individual code documented in understanding HTTP status codes in server logs.

Cardinality caution: these parsers and label_format operate purely at query time — the fields they create are never written to Loki's index, so there is no cardinality penalty regardless of how many distinct URLs flow through. This is precisely why URL analysis is safe in LogQL but catastrophic as a label. Always recover high-cardinality fields here, in the pipeline, not in the Promtail labels stage.

Turning Lines into Crawl Metrics

A log query becomes a metric query when you wrap it in a range function over a [duration]. These four cover the bulk of crawl reporting.

Recipe 1: Googlebot request rate with rate().
rate() returns per-second request frequency averaged over the range — the canonical "is Googlebot crawling harder than usual?" panel.

rate({job="nginx", bot="Googlebot"}[5m])

Expected Grafana panel: a single time-series line in a Time series panel showing Googlebot requests per second, each point smoothed over a trailing 5-minute window. A sudden step up often coincides with a sitemap submission or a content launch; a step down can signal a crawl-budget throttle. The same signal derived with shell tools appears in measuring crawl rate by hour from server logs.

Recipe 2: Status distribution with sum by (status).
Aggregate counts over a window and group by the parsed (or labeled) status to get the response-class mix.

sum by (status) (
  count_over_time({job="nginx", bot="Googlebot"}[1h])
)

Expected Grafana panel: in a Bar gauge or stacked Time series, one series per status code — overwhelmingly 200, a thin 301/304 band, and a small 404 slice. Because status is already a label here, no parser is needed and the query is index-fast. A widening 404 band is your earliest crawl-budget-waste signal.

Recipe 3: Most-crawled URLs with topk().
This is the flagship query: rank the URLs Googlebot hits most, with url recovered at query time so it never bloated the index.

topk(10,
  sum by (url) (
    count_over_time(
      {job="nginx", bot="Googlebot"} | json url="request_uri" [6h]
    )
  )
)

Expected Grafana panel: a Table of the ten most-crawled URLs over six hours with hit counts, highest first. If low-value URLs (faceted-search parameters, paginated archives) dominate the top ten, that is crawl budget being spent in the wrong place. Swap bot="Googlebot" for status="404" to instead rank the most-crawled dead URLs.

Recipe 4: Crawl volume per hour with count_over_time().
For an hourly crawl histogram, use count_over_time with a [1h] range and let Grafana's $__interval align the buckets.

sum(count_over_time({job="nginx", bot="Googlebot"}[1h]))

Expected Grafana panel: a Bar chart with one bar per hour showing total Googlebot requests. Set the panel's interval to 1h and the range to 24h or 7d to expose diurnal crawl patterns — most crawlers ramp during your low-traffic hours, and a flat-then-zero line usually means an ingestion gap, not a crawl pause.

Edge Cases and Gotchas

Gotcha 1: empty results from a parser type mismatch.
Running | json against a combined text line (or | logfmt against JSON) parses nothing, so every downstream field filter silently matches zero lines and the panel reads "No data." Confirm the format first by inspecting one raw line in Explore, then choose json, logfmt, or pattern to match. When in doubt, add | __error__ != "" to surface parse errors instead of hiding them.

Gotcha 2: regex line filters scan every chunk.
A query like {job="nginx"} |~ ".*Googlebot.*" with no label selector forces Loki to decompress and scan every chunk in the range, which is slow and expensive. Always pin the bot (or host/status) label in {} first; reserve |~ for refining within an already-narrow stream set. Prefer the exact-match |= over regex |~ whenever the token is literal.

Verification

Confirm a full recipe end-to-end from the command line before trusting a dashboard panel. This counts Googlebot 404s in the last hour, parsing status from the line.

logcli --addr=http://localhost:3100 instant-query \
  'sum(count_over_time({job="nginx", bot="Googlebot", status="404"}[1h]))'

Expected Output: a single scalar, e.g. {} 37, matching the height of the 404 slice in your status-distribution panel for the same window. If the CLI scalar and the panel disagree, the panel's time range or $__interval is misaligned, not the query.

Common Mistakes

  • Parsing before selecting: Writing {job="nginx"} | json | bot="Googlebot" filters on a parsed field instead of the bot label, forcing a full chunk scan. Fix: put bot="Googlebot" inside {} so the index does the filtering first, then parse.
  • Grouping by a high-cardinality field as a label: Some teams "fix" slow sum by (url) queries by promoting url to a Promtail label. That detonates the index. Fix: keep url parsed at query time and accept that topk/sum by (url) scans chunks — that is the correct, index-safe trade.
  • Using rate() for raw counts: rate() returns per-second averages, not totals, so a "1000 crawls/hour" expectation reads as 0.28. Fix: use count_over_time (and wrap in sum) when you want absolute crawl volume, and reserve rate() for frequency.

Frequently Asked Questions

Should I use | json, | logfmt, or | pattern to extract the URL?
Match the parser to your log format. Use | json for JSON access logs, | logfmt for key=value lines, and | pattern (or | regexp for irregular lines) for the default nginx combined text format. The pattern parser is the fastest and most readable choice for fixed-position combined logs; json is cleanest when you control the nginx log format and emit structured JSON.

Why not just make the URL a label so topk is faster?
Because every distinct URL would become its own Loki stream, and a site with hundreds of thousands of crawled URLs would create hundreds of thousands of streams, ballooning the index and OOM-killing the ingester. Parsing url at query time keeps the index tiny; topk/sum by (url) scanning chunks is the intended, sustainable trade-off. This is the same cardinality discipline weighed across stacks in the ELK vs Vector.dev vs CloudWatch comparison.

How do I confirm a request is really Googlebot and not a spoofer in these queries?
LogQL filters by the user-agent token or your bot label, neither of which proves identity — user agents are trivially forged. For trustworthy crawl metrics, verify the client by reverse DNS before labeling, as covered in identifying search engine bots in server logs, then trust the bot label your pipeline assigns.

Part of the Grafana Loki for SEO Log Aggregation series.