Handling High-Volume Log Backpressure in Vector
When a downstream sink such as Elasticsearch or S3 slows down, Vector does not silently absorb the surge. It propagates backpressure up the topology: the sink stops acknowledging, the connecting buffer fills, and the file source stops reading new lines. During a Googlebot crawl spike or a log-rotation backfill, that stall can freeze your entire crawl-telemetry pipeline within seconds. This guide shows how to diagnose backpressure through Vector's internal metrics and fix it with disk buffers, batching, and sink concurrency tuning, as part of the broader Vector.dev Pipeline Configuration workflow.
You will learn to read component_errors_total, buffer event gauges, and component utilization to pinpoint which component is the bottleneck, then apply a disk-backed buffer with an explicit when_full policy so a slow sink never loses crawl events you need for SEO analysis.
Diagnosis: Spotting a Stalled Source in Real Output
The first symptom is counterintuitive: your source stops emitting even though the web server is still writing logs. Run vector top and watch the events-per-second column collapse to zero on the source while the sink shows non-zero errors.
┌──────────────────────────────────────────────────────────────┐
│ Component │ Events/s │ CPU % │ Memory │ Errors │ Buffer │
├──────────────────────────────────────────────────────────────┤
│ web_access │ 0 │ 0.1% │ 44MB │ 0 │ — │
│ parse_logs │ 0 │ 0.0% │ 110MB │ 0 │ — │
│ es_sink │ 180 │ 9.7% │ 96MB │ 412 │ 98% │
└──────────────────────────────────────────────────────────────┘
The source reads 0 events/s, but es_sink is still draining slowly with a near-full buffer and a climbing error count. That is backpressure, not a crash. Confirm it with a single query against the Prometheus metrics endpoint:
curl -s http://localhost:8686/metrics | grep -E 'buffer_events|component_errors_total'
Expected Output:
vector_buffer_events{component_id="es_sink",stage="0"} 49984
vector_buffer_byte_size{component_id="es_sink"} 41284096
vector_component_errors_total{component_id="es_sink",error_type="request_failed"} 412
A vector_buffer_events value pinned near the buffer's max_events (here ~50k) plus a rising request_failed count is the definitive signature: the sink cannot keep up, the buffer is saturated, and the saturation has rippled back to the source.
Concept: Why the Source Stalls
Vector connects components with bounded buffers. By default these are in-memory buffers with when_full = "block". When the sink slows, the buffer fills; once full, "block" pauses the upstream component, which pauses its upstream, all the way to the file source. This is deliberate: blocking is how Vector guarantees at-least-once delivery without unbounded memory growth. The cost is that a slow Elasticsearch cluster can pause ingestion entirely. The fix is not to remove backpressure but to give it somewhere to go (a disk buffer) and to make the sink drain faster (concurrency and batching).
Step-by-Step Fix
Step 1: Measure component utilization to confirm the bottleneck. Before changing config, prove the sink is the constraint, not the transform. The utilization metric reports how busy each component is on a 0–1 scale.
curl -s http://localhost:8686/metrics | grep vector_utilization
Expected Output:
vector_utilization{component_id="parse_logs"} 0.07
vector_utilization{component_id="es_sink"} 0.99
Explanation: es_sink at 0.99 is saturated; parse_logs at 0.07 is idle. The sink is the bottleneck, so tuning the source or transform would be wasted effort.
Step 2: Switch the sink buffer to disk with an explicit policy. A disk buffer absorbs far more backlog than memory, letting the source keep reading during transient sink slowdowns. Set max_size in bytes and choose when_full deliberately.
[sinks.es_sink]
type = "elasticsearch"
inputs = ["parse_logs"]
endpoints = ["https://es.internal:9200"]
bulk.index = "crawl-logs-%Y.%m.%d"
[sinks.es_sink.buffer]
type = "disk"
max_size = 2147483648 # 2 GiB on-disk backlog
when_full = "block" # never drop crawl events
Explanation: max_size is a byte count, not an event count, for disk buffers (minimum 256 MiB). With when_full = "block", a full buffer still pauses the source rather than discarding data, but 2 GiB of disk buys minutes of headroom instead of the seconds an in-memory buffer provides.
Production Warning: when_full = "drop_newest" discards incoming events the instant the buffer fills. On a crawl-analysis pipeline this silently deletes the Googlebot hits you are trying to measure, corrupting crawl-budget history with no error in the sink logs. Only use drop_newest for genuinely disposable, high-cardinality telemetry, never for the access-log stream that feeds SEO reporting. Default to block and size the disk buffer for your worst-case sink outage.
Step 3: Raise sink concurrency and tune batching. A blocked buffer often means the sink is under-parallelized. Let Vector adapt concurrency automatically and size batches so each request carries meaningful volume without timing out.
[sinks.es_sink.batch]
max_events = 4000
timeout_secs = 5
[sinks.es_sink.request]
concurrency = "adaptive"
retry_attempts = 10
Explanation: concurrency = "adaptive" lets Vector's AIMD controller probe for the highest request rate Elasticsearch tolerates, backing off automatically on 429 responses. Larger batches reduce per-request overhead; timeout_secs caps how long a partial batch waits before flushing.
Step 4: Enable end-to-end acknowledgements. Acknowledgements tie the file source's read cursor to confirmed sink delivery. The source only advances its checkpoint after the sink durably accepts the data, so a crash mid-backlog replays rather than drops.
[sources.web_access.acknowledgements]
enabled = true
Expected Output (after reload, vector top):
│ web_access │ 9,800 │ 2.0% │ 47MB │ 0 │ — │
│ es_sink │ 9,800 │ 7.1% │ 88MB │ 0 │ 12% │
The source now matches the sink at ~9,800 events/s and buffer utilization has dropped from 98% to 12%.
Edge-Case Handling
Disk buffer fills during a multi-hour sink outage. Even 2 GiB eventually saturates if Elasticsearch is down for hours. With when_full = "block" the source stalls and the web server's own log file keeps growing on disk, so no events are lost; once the sink recovers, Vector drains the backlog. Pair this with generous on-disk log retention so the file source can replay rotated files. If you cannot tolerate any source stall, fan out to a second durable sink (for example S3 as a cold archive) so one slow destination never blocks the pipeline.
Adaptive concurrency oscillates under a flapping sink. If Elasticsearch returns intermittent 503s, adaptive concurrency can sawtooth. Confirm with vector_adaptive_concurrency_limit in the metrics; if it never stabilizes, pin a fixed concurrency = 8 as a floor and raise retry_max_duration_secs so transient failures retry instead of erroring out.
Verification
Confirm the fix held by checking that buffer events stay low and the error counter has stopped climbing under load.
curl -s http://localhost:8686/metrics | grep -E 'vector_buffer_events|component_errors_total' | grep es_sink
Expected Output:
vector_buffer_events{component_id="es_sink",stage="0"} 6021
vector_component_errors_total{component_id="es_sink",error_type="request_failed"} 412
The buffer is far below max_size and request_failed is flat (still 412 from the earlier incident, no new failures). A stable, non-climbing error count plus low buffer occupancy confirms backpressure is resolved.
Common Mistakes
- Leaving the default in-memory buffer in production. A memory buffer holds only a few thousand events, so any sink hiccup stalls the source almost instantly. Always declare a disk buffer with a sized
max_sizeon sinks that talk to remote services. - Choosing
drop_newestto "stop the stalls." This trades a visible stall for invisible data loss. The pipeline looks healthy invector topwhile crawl events vanish. Useblockplus a larger disk buffer instead. - Sizing batches by event count without watching timeouts. Huge
max_eventswith a longtimeout_secslets batches sit unflushed, inflating latency and buffer occupancy. Balance batch size againsttimeout_secsso events flush promptly under low traffic.
Frequently Asked Questions
How do I know if Vector is dropping events or just stalling?
Check vector_component_errors_total and your buffer's when_full setting. With when_full = "block" Vector stalls but does not drop, so a flat error counter plus a full buffer means a stall. A climbing discarded_events counter or drop_newest policy means actual loss.
Should the disk buffer live on the same volume as the logs?
Prefer a separate fast volume. Co-locating the buffer with the web server's log directory risks filling the disk that the source is reading from, which can crash the web server itself. Size max_size to your worst expected sink outage and monitor free space.
Does enabling acknowledgements slow the pipeline down?
Throughput cost is small, but acknowledgements require a buffer that supports them (disk buffers do). The benefit is that the file source only advances its read checkpoint after durable delivery, so a restart mid-backlog replays unconfirmed events rather than losing them.
Related Guides
- Parsing Malformed JSON Logs with Vector VRL — keep the transform from erroring under the same load that triggers backpressure.
- ELK Stack Log Ingestion — tune the Elasticsearch side so the sink stops being the bottleneck.
- ELK vs Vector.dev vs CloudWatch for SEO Log Pipelines — compare durability and buffering trade-offs across pipeline architectures.
- Grafana Loki for SEO Log Aggregation — an alternative sink with different backpressure characteristics.
Part of the Vector.dev Pipeline Configuration series.