Parsing JSON Access Logs with jq

Once your access logs are structured JSON — one object per line — "parsing" stops meaning regex and starts meaning field selection. jq is the right tool for this: it reads newline-delimited JSON (NDJSON) a line at a time, references fields by name, and composes filters into pipelines. This page is a practical cookbook. Each recipe is a runnable one-liner against a JSON access log, with the output it produces, so you can adapt rather than invent.

The recipes cover the questions crawl analysis actually asks: filter by status or user-agent, find the top URLs, count requests by status class, rank crawler IPs, bucket hits by hour, and stream files too large to hold in memory. They assume logs in the schema from structured JSON logging for analysis (status, request_uri, http_user_agent, time_iso8601, and friends); adjust key names to match yours. Throughout, we contrast jq (which addresses fields by name) with awk (which addresses them by position) so you know when to reach for which.

Concept: jq Reads JSON, awk Reads Columns

awk splits a line on whitespace and gives you $1, $2, $9. That is fast and perfect for fixed-position combined logs, but it breaks the instant a field contains a space or the column order changes. jq parses each line as JSON and lets you name the field — .status, .http_user_agent — so order and embedded spaces never matter. The trade-off: jq does real JSON parsing per line, so it is slower than awk on raw throughput. The rule of thumb below holds for most crawl work.

Task Reach for Why
Fixed-position combined/common log awk No JSON to parse; column indices are stable and fast
JSON / NDJSON access log jq Fields by name survive reordering and embedded spaces
Field contains spaces or quotes jq awk mis-splits; JSON values are delimited, not positional
Maximum throughput on huge plain-text files awk Lower per-line cost than full JSON parsing
Typed numeric comparisons (status >= 500) jq Numbers stay numbers; no string-to-int casts

For the positional counterpart to these recipes, see awk and grep commands for log filtering. The two toolchains answer the same questions on different log shapes.

Recipe 1: Select by Status and by User-Agent

The atom of every other recipe is select(), which keeps only objects matching a predicate. Filter to server errors:

jq -c 'select(.status >= 500)' access.json.log | head -n2

Expected Output:

{"time_iso8601":"2026-06-19T08:14:02+00:00","remote_addr":"203.0.113.10","request_uri":"/checkout","status":502,"http_user_agent":"Mozilla/5.0"}
{"time_iso8601":"2026-06-19T08:14:09+00:00","remote_addr":"198.51.100.7","request_uri":"/api/cart","status":500,"http_user_agent":"Mozilla/5.0"}

Because status is a JSON number, >= 500 needs no cast — the contrast with awk, where $9 is a string you must coerce. Filter by user-agent with a regex via test(), and combine predicates with and:

jq -c 'select((.http_user_agent | test("Googlebot")) and .status == 404)' access.json.log | head -n2

Expected Output:

{"time_iso8601":"2026-06-19T08:20:11+00:00","remote_addr":"66.249.66.1","request_uri":"/old-page","status":404,"http_user_agent":"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}

Recipe 2: Top URLs with group_by and length

To rank the most-requested URLs, slurp the stream into an array with -s, group by the URI, and sort the group sizes. group_by requires the whole array in memory, so this recipe is for files that fit; the streaming variant comes later.

jq -rs 'group_by(.request_uri)
        | map({uri: .[0].request_uri, hits: length})
        | sort_by(-.hits)[:5][]
        | "\(.hits)\t\(.uri)"' access.json.log

Expected Output:

18342	/
9120	/blog/
4455	/pricing/
2980	/search?q=shoes
2011	/products/widget

The group_by(.request_uri) clusters identical URIs into sub-arrays; length counts each; sort_by(-.hits) orders descending. The awk equivalent uses an associative array (a[$7]++) and is faster on plain text, but cannot address request_uri by name. To zero in on which 404 URLs waste the most crawl, the awk-based companion recipe is finding the top 404 URLs with awk.

Recipe 3: Count by Status Class

Crawl health is often read at the level of status class (2xx, 3xx, 4xx, 5xx) rather than individual codes. Derive the class by integer-dividing by 100, then tally. This streams line by line — no slurp — so it scales:

jq -r '(.status / 100 | floor | tostring) + "xx"' access.json.log \
  | sort | uniq -c | sort -rn

Expected Output:

  220184 2xx
   31022 3xx
    8901 4xx
     642 5xx

A rising 4xx or 5xx share for crawler traffic is direct crawl-budget waste. To interpret which codes within a class matter, see understanding HTTP status codes in server logs. For a pure-jq tally without the shell uniq, jq -rs 'group_by(.status/100|floor) | map("\(.[0].status/100|floor)xx \(length)")[]' does the same, but the streaming form above is friendlier on large files.

Recipe 4: Top Crawler IPs

To find the busiest crawler addresses, filter to bot user-agents, project the IP, and tally. This is the JSON counterpart to ranking IPs from a combined log, and it streams:

jq -r 'select(.http_user_agent | test("bot|crawl|spider"; "i")) | .remote_addr' access.json.log \
  | sort | uniq -c | sort -rn | head

Expected Output:

  41280 66.249.66.1
  19833 157.55.39.27
   8120 66.249.66.4
   3401 5.255.250.100
   1290 114.119.130.10

The leading IPs should reverse-resolve to their claimed operators; the user-agent alone never proves identity. Ranking IPs this way is the first step in separating verified crawlers from impostors, which the awk and grep filtering guide extends with grep prefilters to cut the dataset before jq parses it.

Recipe 5: Hourly Buckets from the Timestamp

To build a crawl-rate-by-hour profile, truncate the ISO 8601 timestamp to the hour. Since time_iso8601 is an RFC 3339 string like 2026-06-19T08:14:02+00:00, the first 13 characters are exactly the date and hour — a string slice, no date library needed:

jq -r 'select(.http_user_agent | test("Googlebot")) | .time_iso8601[0:13]' access.json.log \
  | sort | uniq -c

Expected Output:

   1820 2026-06-19T06
   2110 2026-06-19T07
   2640 2026-06-19T08
   2480 2026-06-19T09

Safety Note: The slice trick assumes every timestamp shares one timezone offset. If your logs mix offsets (for example, servers in different regions, or a daylight-saving transition mid-file), slicing the local-time string buckets two real clocks into the same label. Normalize to UTC at ingestion first. If you have a Unix epoch field instead of an ISO string, convert with (.epoch | gmtime | strftime("%Y-%m-%dT%H")) so bucketing is always UTC.

Recipe 6: Streaming Large Files

The slurp-based recipes (-s, group_by) load the whole file into memory and will fail on multi-gigabyte logs. Two techniques keep memory flat.

Use jq -c and let the shell aggregate. Any recipe that projects one value per line and pipes to sort | uniq -c already streams — jq holds one object at a time. Prefer that shape over group_by whenever the file is large. Recipes 1, 3, 4, and 5 above are all streaming for this reason.

Use --stream for objects too large to hold individually. When even a single record is huge (deeply nested edge logs), --stream emits [path, value] events instead of whole objects, so jq never materializes the full object. For line-delimited access logs the simpler approach is jq -cn 'inputs', which pulls records one at a time without slurping:

# Count Googlebot 404s across a 12GB NDJSON file without loading it
zcat huge-access.json.log.gz \
  | jq -cn 'reduce inputs as $r (0;
              if ($r.http_user_agent | test("Googlebot")) and $r.status == 404
              then . + 1 else . end)'

Expected Output:

1188

Explanation: -n starts with null input; inputs pulls each line on demand inside the reduce, accumulating a single counter. Memory stays constant regardless of file size, unlike -s which would try to build a 12GB array. For Python-based parsing of files at this scale, see parsing 10GB logs with Python and pandas efficiently.

Edge-Case Handling

Lines that are not valid JSON. A truncated write or a stray combined-format line makes jq abort with a parse error mid-file. Add --seq only if your producer wrote RS-delimited JSON; otherwise skip bad lines with a per-line guard. The clean approach is to validate with jq -c . file >/dev/null 2>errors.txt and quarantine the failures — the full repair workflow for broken JSON is parsing malformed JSON logs with Vector VRL.

Missing keys. If a key is absent, jq returns null, and null | test("...") raises an error. Guard string operations with the alternative operator: (.http_user_agent // "") | test("Googlebot") substitutes an empty string when the key is missing, so the filter never throws on a sparse record.

Verification

After adapting a recipe, confirm jq actually parsed every line rather than silently skipping malformed ones. Compare the physical line count against the parsed record count:

echo "lines: $(wc -l < access.json.log)  parsed: $(jq -c . access.json.log 2>/dev/null | wc -l)"

Expected Output:

lines: 260729  parsed: 260729

Equal counts mean every line was valid JSON and your aggregates cover the whole file. If parsed is lower, some lines failed to parse and your totals are undercounts — quarantine and repair them before trusting any metric.

Common Mistakes

  • Slurping a file that does not fit in RAM. jq -s and group_by build one in-memory array and OOM on large logs. Reserve them for files you know are small; use streaming inputs with reduce, or pipe jq -c to sort | uniq -c, for anything large.
  • Calling test() on a possibly-null field. A missing key yields null, and null | test(...) errors out and aborts the run. Always guard with (.field // "") before a string operation on a field that may be absent.
  • Slicing a timestamp string across mixed timezones. The [0:13] hour trick is correct only when every line shares one offset. With mixed offsets or a DST change, normalize to UTC first, or bucket from a Unix epoch field via gmtime.

Frequently Asked Questions

When should I use jq instead of awk for logs?
Use jq whenever the log is JSON: fields are addressed by name, so reordering a column or a value containing spaces never breaks the filter, and numbers stay typed for comparisons like status >= 500. Use awk for fixed-position plain-text logs, where its lower per-line cost makes it faster and column indices are stable.

How do I run jq on a file too big to fit in memory?
Avoid the slurp operator -s and group_by. Instead, write streaming filters that emit one value per line and let the shell aggregate with sort | uniq -c, or use jq -cn 'reduce inputs as $r (...)' to fold the file into an accumulator one record at a time. Both keep memory constant regardless of file size.

Why does my jq filter error out partway through the file?
Almost always a malformed line (a truncated write or a non-JSON line mixed in) or a test()/string operation on a missing key that evaluated to null. Validate with jq -c . file >/dev/null to find the bad line, and guard optional fields with the // "" alternative operator so absent keys do not throw.

Part of the Structured JSON Logging for Analysis series.