ELK Stack Log Ingestion
A robust ELK Stack log ingestion pipeline turns raw web server access logs into queryable crawl intelligence: Googlebot frequency, status-code distribution, crawl-budget waste, and per-URL hit counts, all searchable in Kibana within seconds. This guide walks the full path a log line travels — Filebeat ships it, Logstash parses it with grok, Elasticsearch indexes and ages it under ILM, and Kibana visualizes it — with runnable configuration and expected output at every stage. Teams often prototype with a lightweight Python logparser setup or deploy a Node.js and GoAccess integration for real-time terminal monitoring, but once you need free-text search across every field and centralized retention, a full search index is the right tool.
This page sits inside the broader Log Parsing Workflows & CLI Toolchains collection. You will verify Elasticsearch node health, configure Filebeat to ship nginx logs over TLS, build a Logstash grok pipeline that tags bot traffic and enriches with GeoIP, apply an Index Lifecycle Management policy for cost control, and troubleshoot the failure modes that silently drop crawl data.
Key Implementation Objectives:
- Deploy Filebeat for secure, backpressure-aware log shipping
- Design a Logstash grok pipeline that parses combined-format logs and isolates crawlers
- Apply ILM so daily indices roll over, shrink, and self-expire
- Extract SEO-specific fields and keep IPs and URLs as keyword types
Prerequisites
This guide targets a current Elastic Stack release (8.x or 9.x) and assumes the following are in place before you begin:
| Component | Minimum version | Role |
|---|---|---|
| Elasticsearch | 8.x+ | Search index and storage, with dedicated master/data/ingest roles |
| Logstash | 8.x+ | Centralized grok parsing, enrichment, and routing |
| Filebeat | 8.x+ (9 removes the log input) |
Lightweight edge shipper with the filestream input |
| Kibana | 8.x+ | Dashboards and the Discover/Lens query surface |
| nginx | any | Source of access.log in combined format |
You need shell access to each web host, a running Elasticsearch cluster reachable on 9200, TLS certificates for Filebeat-to-Logstash transport on 5044, and the ES_PASSWORD for the elastic user exported in your shell. Decide your nginx log format up front — the default combined format maps cleanly onto Logstash's %{COMBINEDAPACHELOG} pattern, but if you can emit structured JSON logging instead, you skip grok entirely and parse with a json filter.
Filebeat Shipping & Elasticsearch Topology
The full data flow is below: Filebeat tails each access log and ships line-by-line to Logstash, where a grok filter parses the raw string into structured fields; Logstash bulk-indexes the documents into Elasticsearch, which an ILM policy rolls over and ages; Kibana queries the resulting indices. The grok-parse stage is the load-bearing transformation — get its pattern wrong and every downstream field is empty.
Step 1: Verify node health and roles
Before shipping anything, confirm the Elasticsearch nodes are healthy and the Elasticsearch cluster status is green. This single call reports node roles and shard allocation across all data nodes.
curl -s -X GET "https://elasticsearch:9200/_cluster/health?pretty" \
-u elastic:${ES_PASSWORD}
Expected Output:
{
"cluster_name": "seo-logs-cluster",
"status": "green",
"number_of_nodes": 3,
"number_of_data_nodes": 2
}
A yellow status means replicas are unassigned (acceptable on a single-node test box); red means a primary shard is missing and you must resolve it before ingesting. The _cluster/health endpoint reports the aggregate state across the Elasticsearch nodes, not a single host.
Production Warning: Never expose port 9200 or 5044 to public networks. Restrict access via security groups or firewall rules and enforce mutual TLS so a rogue client cannot inject forged log documents across all data nodes.
Step 2: Deploy the Filebeat configuration
Create /etc/filebeat/filebeat.yml. This targets the nginx access log with the modern filestream input (the legacy log input was removed in Filebeat 9), stitches any multiline entries, and routes to Logstash over TLS.
filebeat.inputs:
- type: filestream
id: nginx-access
enabled: true
paths:
- /var/log/nginx/access.log
parsers:
- multiline:
type: pattern
pattern: '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
negate: true
match: after
output.logstash:
hosts: ["logstash:5044"]
ssl.certificate_authorities: ["/etc/filebeat/certs/ca.crt"]
Explanation: The filestream input requires a unique id per input — reusing one silently breaks state tracking. The multiline parser treats any line not starting with an IP address as a continuation of the previous entry, so a wrapped log line is never split into two documents.
Step 3: Validate and restart the agent
Test the configuration syntax before touching a production service, then restart and confirm it is publishing.
sudo filebeat test config -e
sudo systemctl restart filebeat
sudo systemctl status filebeat
Expected Output: test config prints Config OK. After restart, /var/log/filebeat/filebeat shows Successfully published N events lines, and the registry at /var/lib/filebeat/registry advances its offset without corruption.
Production Warning: Filebeat applies backpressure automatically when Logstash queues fill, but if the registry file is deleted while Filebeat runs, it re-reads every log from the top and floods Elasticsearch with duplicates. Stop the service before touching the registry.
Logstash Grok Pipeline & Field Mapping
This is where raw text becomes structured, queryable data. The pipeline parses the combined log format with grok, tags crawler traffic, normalizes the timestamp, and enriches with GeoIP. The %{COMBINEDAPACHELOG} pattern populates clientip, timestamp, verb, request, response, bytes, referrer, and agent. Note the user agent lands in the agent field — not http_user_agent, a common copy-paste error that makes bot tagging silently never fire.
Step 1: Implement the parsing pipeline
Save this to /etc/logstash/conf.d/seo-pipeline.conf.
input {
beats { port => 5044 }
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
if [agent] =~ /bot|crawl|spider/i {
mutate { add_tag => ["bot_traffic"] }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
target => "@timestamp"
}
geoip { source => "clientip" }
}
output {
elasticsearch {
hosts => ["https://elasticsearch:9200"]
index => "seo-logs-%{+YYYY.MM.dd}"
user => "elastic"
password => "${ES_PASSWORD}"
}
}
Explanation: The date filter overrides @timestamp with the real request time from the log line. Skip it and Kibana time-series charts plot ingest time, which is wrong during any shipping lag or backfill. The bot_traffic tag lets you slice crawler vs. human traffic without re-parsing the user agent at query time.
Step 2: Validate the pipeline and restart
Use the built-in validator to catch grok and Ruby syntax errors before deployment.
sudo /usr/share/logstash/bin/logstash --path.settings /etc/logstash -t
sudo systemctl restart logstash
Expected Output: the validator prints Configuration OK. After restart, /var/log/logstash/logstash-plain.log shows Pipeline main started. Send a known-good line through nc localhost 5044 and confirm a document appears in the day's index with verb, response, and the bot_traffic tag populated.
Production Warning: A grok pattern that fails to match a line adds a _grokparsefailure tag and passes the message through with no extracted fields. These documents still index and quietly skew every aggregation. Add a if "_grokparsefailure" in [tags] branch that routes failures to a separate index so you can monitor and fix them rather than silently absorbing bad data.
The table below is the field map for a combined-format nginx line feeding crawl analytics, and which Elasticsearch type each field should carry.
| Field | Example value | ES type | Why this type |
|---|---|---|---|
clientip |
66.249.66.1 |
ip |
Enables CIDR range queries and GeoIP joins |
request |
/products/widget-42 |
keyword |
Exact-match aggregations; no full-text analysis |
verb |
GET, HEAD |
keyword |
Crawlers favor GET/HEAD; cheap term aggregation |
response |
200, 404, 301 |
keyword (or short) |
Status triage by exact code |
agent |
full UA string | keyword |
Exact bot signatures; avoid text-analyzer overhead |
bytes |
5123 |
long |
Numeric range and sum aggregations |
@timestamp |
parsed request time | date |
The time axis for every crawl-rate chart |
geoip.location |
{lat, lon} |
geo_point |
Map visualizations of crawl origin |
SEO callout — keyword vs. text. Mapping request and agent as keyword rather than the default text is the single biggest storage and performance win. Text analysis tokenizes every URL and user agent, ballooning the index and making exact-match crawl aggregations slow. Keyword stores the value verbatim, exactly what terms aggregations on top-crawled URLs need. The trade-offs between exact and analyzed fields mirror the field semantics covered in understanding HTTP status codes in server logs.
Index Lifecycle Management & Templates
Daily indices are easy to reason about but accumulate fast. Index Lifecycle Management (ILM) automates rollover, shrink, force-merge, and deletion so storage cost stays bounded without manual intervention across the Elasticsearch nodes.
Step 1: Apply the ILM policy
This policy keeps indices hot for active writes, transitions to warm after seven days (shrinking to one shard and force-merging for query speed), and deletes after 90 days.
curl -X PUT "https://elasticsearch:9200/_ilm/policy/seo_logs_retention" \
-H "Content-Type: application/json" \
-u elastic:${ES_PASSWORD} \
-d '{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": { "max_size": "50gb", "max_age": "1d" },
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"delete": {
"min_age": "90d",
"actions": { "delete": {} }
}
}
}
}'
Expected Output: {"acknowledged":true}. Confirm the phases with curl -s "https://elasticsearch:9200/_ilm/policy/seo_logs_retention?pretty" -u elastic:${ES_PASSWORD}.
The ILM phases map directly onto cost tiers:
| Phase | Triggers at | Actions | Goal |
|---|---|---|---|
| Hot | index creation | rollover at 50gb / 1d, high priority | Fast writes and recent-data queries |
| Warm | 7 days | shrink to 1 shard, force-merge to 1 segment | Cheaper storage, still queryable |
| Delete | 90 days | delete index | Bound total storage and meet retention policy |
Step 2: Attach the policy via an index template
A template applies the policy and field mappings to every seo-logs-* index automatically at rollover.
curl -X PUT "https://elasticsearch:9200/_index_template/seo-logs-template" \
-H "Content-Type: application/json" \
-u elastic:${ES_PASSWORD} \
-d '{
"index_patterns": ["seo-logs-*"],
"template": {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"index.lifecycle.name": "seo_logs_retention",
"index.lifecycle.rollover_alias": "seo-logs"
},
"mappings": {
"properties": {
"clientip": { "type": "ip" },
"request": { "type": "keyword" },
"agent": { "type": "keyword" }
}
}
}
}'
Expected Output: {"acknowledged":true}. New indices created after this call inherit the ILM policy and the keyword mappings.
Production Warning: Avoid over-provisioning primary shards. Start with one or two shards per daily index. Each shard carries fixed memory and file-handle overhead on the data nodes, so thousands of tiny shards degrade query performance and complicate ILM transitions far more than a few correctly sized ones. Aim for shards in the 10–50gb range. The retention window here also has to satisfy any legal hold; align it with your log retention policies before deleting anything.
Validation & Troubleshooting
ELK ingestion failures are usually about grok mismatches, circuit breakers, or timestamps. Each named failure mode below has a confirming command and a recovery recipe. For an architectural deep dive into how these components scale, see the companion guide on ELK Stack architecture for SEO log analysis.
Step 1: Monitor pipeline health
Confirm indices are being created and documents are flowing across all data nodes.
curl -s "https://elasticsearch:9200/_cat/indices/seo-logs-*?v&h=index,docs.count,store.size" \
-u elastic:${ES_PASSWORD}
Expected Output:
index docs.count store.size
.ds-seo-logs-2024.10.25-000001 1542000 1.2gb
A stalled docs.count means ingestion stopped — work through the failure modes below.
Failure Mode 1: Grok parse failures.
Symptom: documents arrive but verb, response, and agent are empty, and the _grokparsefailure tag is present.
curl -s "https://elasticsearch:9200/seo-logs-*/_count" \
-u elastic:${ES_PASSWORD} \
-H "Content-Type: application/json" \
-d '{"query":{"term":{"tags":"_grokparsefailure"}}}'
Recovery: A non-zero count means your log format diverged from %{COMBINEDAPACHELOG} — often a custom log_format directive adding an upstream-time or request-ID field. Build the corrected pattern in Kibana's Grok Debugger against a real line, then redeploy the pipeline.
Failure Mode 2: Registry corruption (duplicate or missing docs).
Symptom: a single host either re-ships old lines or stops shipping entirely.
Recovery: Stop Filebeat, delete /var/lib/filebeat/registry, and restart only if you accept a one-time re-read. Otherwise, fix the underlying disk issue and let Filebeat resume from its last offset.
Failure Mode 3: Circuit breakers rejecting bulk requests.
Symptom: Logstash logs 429 Too Many Requests or circuit_breaking_exception.
curl -s "https://elasticsearch:9200/_nodes/stats/breaker?pretty" \
-u elastic:${ES_PASSWORD} | grep -A3 '"parent"'
Recovery: Enable Logstash persistent queues (queue.type: persisted) to absorb spikes, lower Filebeat's bulk_max_size, and only as a last resort raise indices.breaker.total.limit to 75% in elasticsearch.yml. Raising the breaker without adding memory just delays the out-of-memory kill.
Failure Mode 4: Timestamp gaps in Kibana.
Symptom: data appears clustered at ingest time or arrives out of order on the time axis.
Recovery: Confirm the date filter's pattern matches your log exactly (dd/MMM/yyyy:HH:mm:ss Z for nginx default) and that it targets @timestamp. A timezone mismatch here distorts every crawl-rate chart.
Failure Mode 5: Parse failures piling up in the DLQ.
Symptom: events vanish from the main index under heavy malformed input. Enable and inspect the dead letter queue.
# In /etc/logstash/logstash.yml, enable DLQ:
# dead_letter_queue.enable: true
# dead_letter_queue.max_bytes: 1024mb
# Reprocess DLQ events by pointing a separate pipeline at the
# dead_letter_queue input plugin in a dedicated pipeline config —
# not via a --path.data override.
Recovery: Read the DLQ with a second pipeline that uses the dead_letter_queue input, inspect the reason metadata on each event, fix the mapping or grok issue, and re-emit. This recovers data that a mapping conflict would otherwise lose.
Common Mistakes
- Putting the user agent in the wrong field: Matching
[http_user_agent]when%{COMBINEDAPACHELOG}produces[agent]means thebot_traffictag never fires and crawler analysis is empty. Fix: tag on[agent], or rename the field once withmutate. - Leaving
requestandagentastext: The default text analyzer tokenizes every URL and user agent, bloating the index and slowing exact-match aggregations. Fix: map both askeywordin the index template before the first rollover. - Skipping the
datefilter: Without it,@timestampis ingest time, so every crawl-rate and time-series chart is wrong during lag or backfill. Fix: add thedatefilter with the exact nginx layout targeting@timestamp. - Over-provisioning primary shards: Too many small shards multiply heap and file-handle overhead on the data nodes and slow queries. Fix: start with one or two shards per daily index and let ILM shrink warm indices to one.
- Ignoring
_grokparsefailuredocuments: Unparsed lines still index and quietly skew aggregations. Fix: route failures to a separate index and alert on their count rather than absorbing them.
Frequently Asked Questions
How do I handle high-volume log ingestion without overwhelming Elasticsearch?
Implement Logstash persistent queues (queue.type: persisted) to absorb bursts, tune Filebeat output batch sizes (bulk_max_size), and use ILM to roll indices over before they exceed an optimal shard size. If write pressure still spikes the circuit breakers, front the pipeline with a buffer; the Vector.dev pipeline configuration guide covers backpressure-aware buffering you can place ahead of Logstash.
Can the ELK Stack replace traditional SEO log analyzers?
Yes. Paired with custom Kibana dashboards and grok parsers, ELK provides scalability, real-time alerting, and deeper crawl-budget tracking than legacy GUI analyzers. The trade-off is operational overhead: you run and tune Elasticsearch. If you only need crawl-rate and status slicing without free-text search, a lighter index like Grafana Loki log aggregation is cheaper to operate.
ELK, Vector, or CloudWatch — which pipeline should I choose?
It depends on scale, budget, and whether you need full free-text search. ELK is strongest when you query arbitrary fields across billions of documents; managed and index-light alternatives win on cost and operational simplicity. The ELK vs Vector.dev vs CloudWatch for SEO log pipelines comparison weighs each option against concrete crawl-analytics workloads.
What is the best way to filter out internal IP traffic and staging environments?
Use a Logstash conditional, but note the in operator does not understand CIDR ranges natively — use the cidr filter plugin to test clientip against ["10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"] and drop {} matches, or route them to a separate internal-logs-* index so they never pollute crawl aggregations.
Related Guides
- ELK Stack Architecture for SEO Log Analysis — the data-flow blueprint and node-sizing detail behind this pipeline.
- ELK vs Vector.dev vs CloudWatch for SEO Log Pipelines — choose the right aggregation stack for your scale and budget.
- Grafana Loki for SEO Log Aggregation — the index-light alternative when you do not need full-text search.
- Vector.dev Pipeline Configuration — front Logstash with a buffer for backpressure and advanced transforms.
- Structured JSON Logging for Analysis — emit JSON from nginx to skip grok entirely.
Part of the Log Parsing Workflows & CLI Toolchains series.