ELK Stack Log Ingestion

A robust ELK Stack log ingestion pipeline turns raw web server access logs into queryable crawl intelligence: Googlebot frequency, status-code distribution, crawl-budget waste, and per-URL hit counts, all searchable in Kibana within seconds. This guide walks the full path a log line travels — Filebeat ships it, Logstash parses it with grok, Elasticsearch indexes and ages it under ILM, and Kibana visualizes it — with runnable configuration and expected output at every stage. Teams often prototype with a lightweight Python logparser setup or deploy a Node.js and GoAccess integration for real-time terminal monitoring, but once you need free-text search across every field and centralized retention, a full search index is the right tool.

This page sits inside the broader Log Parsing Workflows & CLI Toolchains collection. You will verify Elasticsearch node health, configure Filebeat to ship nginx logs over TLS, build a Logstash grok pipeline that tags bot traffic and enriches with GeoIP, apply an Index Lifecycle Management policy for cost control, and troubleshoot the failure modes that silently drop crawl data.

Key Implementation Objectives:

  • Deploy Filebeat for secure, backpressure-aware log shipping
  • Design a Logstash grok pipeline that parses combined-format logs and isolates crawlers
  • Apply ILM so daily indices roll over, shrink, and self-expire
  • Extract SEO-specific fields and keep IPs and URLs as keyword types

Prerequisites

This guide targets a current Elastic Stack release (8.x or 9.x) and assumes the following are in place before you begin:

Component Minimum version Role
Elasticsearch 8.x+ Search index and storage, with dedicated master/data/ingest roles
Logstash 8.x+ Centralized grok parsing, enrichment, and routing
Filebeat 8.x+ (9 removes the log input) Lightweight edge shipper with the filestream input
Kibana 8.x+ Dashboards and the Discover/Lens query surface
nginx any Source of access.log in combined format

You need shell access to each web host, a running Elasticsearch cluster reachable on 9200, TLS certificates for Filebeat-to-Logstash transport on 5044, and the ES_PASSWORD for the elastic user exported in your shell. Decide your nginx log format up front — the default combined format maps cleanly onto Logstash's %{COMBINEDAPACHELOG} pattern, but if you can emit structured JSON logging instead, you skip grok entirely and parse with a json filter.

Filebeat Shipping & Elasticsearch Topology

The full data flow is below: Filebeat tails each access log and ships line-by-line to Logstash, where a grok filter parses the raw string into structured fields; Logstash bulk-indexes the documents into Elasticsearch, which an ILM policy rolls over and ages; Kibana queries the resulting indices. The grok-parse stage is the load-bearing transformation — get its pattern wrong and every downstream field is empty.

ELK Stack log ingestion data flow nginx access logs are tailed by Filebeat and shipped over TLS to Logstash, where a grok filter parses the combined log line into structured fields, tags bot traffic, and enriches with GeoIP; Logstash bulk-indexes into Elasticsearch where an ILM policy rolls over and ages daily indices; Kibana queries the indices for crawl dashboards. The grok parse stage is highlighted as the critical transformation. nginx access.log raw lines Filebeat ship over TLS Logstash grok parse tag bot + GeoIP Elasticsearch index daily ILM hot/warm Kibana dashboards critical: validate grok a mismatch empties every field

Step 1: Verify node health and roles
Before shipping anything, confirm the Elasticsearch nodes are healthy and the Elasticsearch cluster status is green. This single call reports node roles and shard allocation across all data nodes.

curl -s -X GET "https://elasticsearch:9200/_cluster/health?pretty" \
  -u elastic:${ES_PASSWORD}

Expected Output:

{
  "cluster_name": "seo-logs-cluster",
  "status": "green",
  "number_of_nodes": 3,
  "number_of_data_nodes": 2
}

A yellow status means replicas are unassigned (acceptable on a single-node test box); red means a primary shard is missing and you must resolve it before ingesting. The _cluster/health endpoint reports the aggregate state across the Elasticsearch nodes, not a single host.

Production Warning: Never expose port 9200 or 5044 to public networks. Restrict access via security groups or firewall rules and enforce mutual TLS so a rogue client cannot inject forged log documents across all data nodes.

Step 2: Deploy the Filebeat configuration
Create /etc/filebeat/filebeat.yml. This targets the nginx access log with the modern filestream input (the legacy log input was removed in Filebeat 9), stitches any multiline entries, and routes to Logstash over TLS.

filebeat.inputs:
  - type: filestream
    id: nginx-access
    enabled: true
    paths:
      - /var/log/nginx/access.log
    parsers:
      - multiline:
          type: pattern
          pattern: '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
          negate: true
          match: after

output.logstash:
  hosts: ["logstash:5044"]
  ssl.certificate_authorities: ["/etc/filebeat/certs/ca.crt"]

Explanation: The filestream input requires a unique id per input — reusing one silently breaks state tracking. The multiline parser treats any line not starting with an IP address as a continuation of the previous entry, so a wrapped log line is never split into two documents.

Step 3: Validate and restart the agent
Test the configuration syntax before touching a production service, then restart and confirm it is publishing.

sudo filebeat test config -e
sudo systemctl restart filebeat
sudo systemctl status filebeat

Expected Output: test config prints Config OK. After restart, /var/log/filebeat/filebeat shows Successfully published N events lines, and the registry at /var/lib/filebeat/registry advances its offset without corruption.

Production Warning: Filebeat applies backpressure automatically when Logstash queues fill, but if the registry file is deleted while Filebeat runs, it re-reads every log from the top and floods Elasticsearch with duplicates. Stop the service before touching the registry.

Logstash Grok Pipeline & Field Mapping

This is where raw text becomes structured, queryable data. The pipeline parses the combined log format with grok, tags crawler traffic, normalizes the timestamp, and enriches with GeoIP. The %{COMBINEDAPACHELOG} pattern populates clientip, timestamp, verb, request, response, bytes, referrer, and agent. Note the user agent lands in the agent field — not http_user_agent, a common copy-paste error that makes bot tagging silently never fire.

Step 1: Implement the parsing pipeline
Save this to /etc/logstash/conf.d/seo-pipeline.conf.

input {
  beats { port => 5044 }
}
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  if [agent] =~ /bot|crawl|spider/i {
    mutate { add_tag => ["bot_traffic"] }
  }
  date {
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    target => "@timestamp"
  }
  geoip { source => "clientip" }
}
output {
  elasticsearch {
    hosts => ["https://elasticsearch:9200"]
    index => "seo-logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "${ES_PASSWORD}"
  }
}

Explanation: The date filter overrides @timestamp with the real request time from the log line. Skip it and Kibana time-series charts plot ingest time, which is wrong during any shipping lag or backfill. The bot_traffic tag lets you slice crawler vs. human traffic without re-parsing the user agent at query time.

Step 2: Validate the pipeline and restart
Use the built-in validator to catch grok and Ruby syntax errors before deployment.

sudo /usr/share/logstash/bin/logstash --path.settings /etc/logstash -t
sudo systemctl restart logstash

Expected Output: the validator prints Configuration OK. After restart, /var/log/logstash/logstash-plain.log shows Pipeline main started. Send a known-good line through nc localhost 5044 and confirm a document appears in the day's index with verb, response, and the bot_traffic tag populated.

Production Warning: A grok pattern that fails to match a line adds a _grokparsefailure tag and passes the message through with no extracted fields. These documents still index and quietly skew every aggregation. Add a if "_grokparsefailure" in [tags] branch that routes failures to a separate index so you can monitor and fix them rather than silently absorbing bad data.

The table below is the field map for a combined-format nginx line feeding crawl analytics, and which Elasticsearch type each field should carry.

Field Example value ES type Why this type
clientip 66.249.66.1 ip Enables CIDR range queries and GeoIP joins
request /products/widget-42 keyword Exact-match aggregations; no full-text analysis
verb GET, HEAD keyword Crawlers favor GET/HEAD; cheap term aggregation
response 200, 404, 301 keyword (or short) Status triage by exact code
agent full UA string keyword Exact bot signatures; avoid text-analyzer overhead
bytes 5123 long Numeric range and sum aggregations
@timestamp parsed request time date The time axis for every crawl-rate chart
geoip.location {lat, lon} geo_point Map visualizations of crawl origin

SEO callout — keyword vs. text. Mapping request and agent as keyword rather than the default text is the single biggest storage and performance win. Text analysis tokenizes every URL and user agent, ballooning the index and making exact-match crawl aggregations slow. Keyword stores the value verbatim, exactly what terms aggregations on top-crawled URLs need. The trade-offs between exact and analyzed fields mirror the field semantics covered in understanding HTTP status codes in server logs.

Index Lifecycle Management & Templates

Daily indices are easy to reason about but accumulate fast. Index Lifecycle Management (ILM) automates rollover, shrink, force-merge, and deletion so storage cost stays bounded without manual intervention across the Elasticsearch nodes.

Step 1: Apply the ILM policy
This policy keeps indices hot for active writes, transitions to warm after seven days (shrinking to one shard and force-merging for query speed), and deletes after 90 days.

curl -X PUT "https://elasticsearch:9200/_ilm/policy/seo_logs_retention" \
  -H "Content-Type: application/json" \
  -u elastic:${ES_PASSWORD} \
  -d '{
    "policy": {
      "phases": {
        "hot": {
          "min_age": "0ms",
          "actions": {
            "rollover": { "max_size": "50gb", "max_age": "1d" },
            "set_priority": { "priority": 100 }
          }
        },
        "warm": {
          "min_age": "7d",
          "actions": {
            "shrink": { "number_of_shards": 1 },
            "forcemerge": { "max_num_segments": 1 }
          }
        },
        "delete": {
          "min_age": "90d",
          "actions": { "delete": {} }
        }
      }
    }
  }'

Expected Output: {"acknowledged":true}. Confirm the phases with curl -s "https://elasticsearch:9200/_ilm/policy/seo_logs_retention?pretty" -u elastic:${ES_PASSWORD}.

The ILM phases map directly onto cost tiers:

Phase Triggers at Actions Goal
Hot index creation rollover at 50gb / 1d, high priority Fast writes and recent-data queries
Warm 7 days shrink to 1 shard, force-merge to 1 segment Cheaper storage, still queryable
Delete 90 days delete index Bound total storage and meet retention policy

Step 2: Attach the policy via an index template
A template applies the policy and field mappings to every seo-logs-* index automatically at rollover.

curl -X PUT "https://elasticsearch:9200/_index_template/seo-logs-template" \
  -H "Content-Type: application/json" \
  -u elastic:${ES_PASSWORD} \
  -d '{
    "index_patterns": ["seo-logs-*"],
    "template": {
      "settings": {
        "number_of_shards": 2,
        "number_of_replicas": 1,
        "index.lifecycle.name": "seo_logs_retention",
        "index.lifecycle.rollover_alias": "seo-logs"
      },
      "mappings": {
        "properties": {
          "clientip": { "type": "ip" },
          "request": { "type": "keyword" },
          "agent": { "type": "keyword" }
        }
      }
    }
  }'

Expected Output: {"acknowledged":true}. New indices created after this call inherit the ILM policy and the keyword mappings.

Production Warning: Avoid over-provisioning primary shards. Start with one or two shards per daily index. Each shard carries fixed memory and file-handle overhead on the data nodes, so thousands of tiny shards degrade query performance and complicate ILM transitions far more than a few correctly sized ones. Aim for shards in the 10–50gb range. The retention window here also has to satisfy any legal hold; align it with your log retention policies before deleting anything.

Validation & Troubleshooting

ELK ingestion failures are usually about grok mismatches, circuit breakers, or timestamps. Each named failure mode below has a confirming command and a recovery recipe. For an architectural deep dive into how these components scale, see the companion guide on ELK Stack architecture for SEO log analysis.

Step 1: Monitor pipeline health
Confirm indices are being created and documents are flowing across all data nodes.

curl -s "https://elasticsearch:9200/_cat/indices/seo-logs-*?v&h=index,docs.count,store.size" \
  -u elastic:${ES_PASSWORD}

Expected Output:

index                            docs.count store.size
.ds-seo-logs-2024.10.25-000001     1542000    1.2gb

A stalled docs.count means ingestion stopped — work through the failure modes below.

Failure Mode 1: Grok parse failures.
Symptom: documents arrive but verb, response, and agent are empty, and the _grokparsefailure tag is present.

curl -s "https://elasticsearch:9200/seo-logs-*/_count" \
  -u elastic:${ES_PASSWORD} \
  -H "Content-Type: application/json" \
  -d '{"query":{"term":{"tags":"_grokparsefailure"}}}'

Recovery: A non-zero count means your log format diverged from %{COMBINEDAPACHELOG} — often a custom log_format directive adding an upstream-time or request-ID field. Build the corrected pattern in Kibana's Grok Debugger against a real line, then redeploy the pipeline.

Failure Mode 2: Registry corruption (duplicate or missing docs).
Symptom: a single host either re-ships old lines or stops shipping entirely.

Recovery: Stop Filebeat, delete /var/lib/filebeat/registry, and restart only if you accept a one-time re-read. Otherwise, fix the underlying disk issue and let Filebeat resume from its last offset.

Failure Mode 3: Circuit breakers rejecting bulk requests.
Symptom: Logstash logs 429 Too Many Requests or circuit_breaking_exception.

curl -s "https://elasticsearch:9200/_nodes/stats/breaker?pretty" \
  -u elastic:${ES_PASSWORD} | grep -A3 '"parent"'

Recovery: Enable Logstash persistent queues (queue.type: persisted) to absorb spikes, lower Filebeat's bulk_max_size, and only as a last resort raise indices.breaker.total.limit to 75% in elasticsearch.yml. Raising the breaker without adding memory just delays the out-of-memory kill.

Failure Mode 4: Timestamp gaps in Kibana.
Symptom: data appears clustered at ingest time or arrives out of order on the time axis.

Recovery: Confirm the date filter's pattern matches your log exactly (dd/MMM/yyyy:HH:mm:ss Z for nginx default) and that it targets @timestamp. A timezone mismatch here distorts every crawl-rate chart.

Failure Mode 5: Parse failures piling up in the DLQ.
Symptom: events vanish from the main index under heavy malformed input. Enable and inspect the dead letter queue.

# In /etc/logstash/logstash.yml, enable DLQ:
# dead_letter_queue.enable: true
# dead_letter_queue.max_bytes: 1024mb

# Reprocess DLQ events by pointing a separate pipeline at the
# dead_letter_queue input plugin in a dedicated pipeline config —
# not via a --path.data override.

Recovery: Read the DLQ with a second pipeline that uses the dead_letter_queue input, inspect the reason metadata on each event, fix the mapping or grok issue, and re-emit. This recovers data that a mapping conflict would otherwise lose.

Common Mistakes

  • Putting the user agent in the wrong field: Matching [http_user_agent] when %{COMBINEDAPACHELOG} produces [agent] means the bot_traffic tag never fires and crawler analysis is empty. Fix: tag on [agent], or rename the field once with mutate.
  • Leaving request and agent as text: The default text analyzer tokenizes every URL and user agent, bloating the index and slowing exact-match aggregations. Fix: map both as keyword in the index template before the first rollover.
  • Skipping the date filter: Without it, @timestamp is ingest time, so every crawl-rate and time-series chart is wrong during lag or backfill. Fix: add the date filter with the exact nginx layout targeting @timestamp.
  • Over-provisioning primary shards: Too many small shards multiply heap and file-handle overhead on the data nodes and slow queries. Fix: start with one or two shards per daily index and let ILM shrink warm indices to one.
  • Ignoring _grokparsefailure documents: Unparsed lines still index and quietly skew aggregations. Fix: route failures to a separate index and alert on their count rather than absorbing them.

Frequently Asked Questions

How do I handle high-volume log ingestion without overwhelming Elasticsearch?
Implement Logstash persistent queues (queue.type: persisted) to absorb bursts, tune Filebeat output batch sizes (bulk_max_size), and use ILM to roll indices over before they exceed an optimal shard size. If write pressure still spikes the circuit breakers, front the pipeline with a buffer; the Vector.dev pipeline configuration guide covers backpressure-aware buffering you can place ahead of Logstash.

Can the ELK Stack replace traditional SEO log analyzers?
Yes. Paired with custom Kibana dashboards and grok parsers, ELK provides scalability, real-time alerting, and deeper crawl-budget tracking than legacy GUI analyzers. The trade-off is operational overhead: you run and tune Elasticsearch. If you only need crawl-rate and status slicing without free-text search, a lighter index like Grafana Loki log aggregation is cheaper to operate.

ELK, Vector, or CloudWatch — which pipeline should I choose?
It depends on scale, budget, and whether you need full free-text search. ELK is strongest when you query arbitrary fields across billions of documents; managed and index-light alternatives win on cost and operational simplicity. The ELK vs Vector.dev vs CloudWatch for SEO log pipelines comparison weighs each option against concrete crawl-analytics workloads.

What is the best way to filter out internal IP traffic and staging environments?
Use a Logstash conditional, but note the in operator does not understand CIDR ranges natively — use the cidr filter plugin to test clientip against ["10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"] and drop {} matches, or route them to a separate internal-logs-* index so they never pollute crawl aggregations.

Part of the Log Parsing Workflows & CLI Toolchains series.