Handling High-Volume Log Backpressure in Vector

When a downstream sink such as Elasticsearch or S3 slows down, Vector does not silently absorb the surge. It propagates backpressure up the topology: the sink stops acknowledging, the connecting buffer fills, and the file source stops reading new lines. During a Googlebot crawl spike or a log-rotation backfill, that stall can freeze your entire crawl-telemetry pipeline within seconds. This guide shows how to diagnose backpressure through Vector's internal metrics and fix it with disk buffers, batching, and sink concurrency tuning, as part of the broader Vector.dev Pipeline Configuration workflow.

You will learn to read component_errors_total, buffer event gauges, and component utilization to pinpoint which component is the bottleneck, then apply a disk-backed buffer with an explicit when_full policy so a slow sink never loses crawl events you need for SEO analysis.

Diagnosis: Spotting a Stalled Source in Real Output

The first symptom is counterintuitive: your source stops emitting even though the web server is still writing logs. Run vector top and watch the events-per-second column collapse to zero on the source while the sink shows non-zero errors.

┌──────────────────────────────────────────────────────────────┐
│ Component      │ Events/s │ CPU % │ Memory │ Errors │ Buffer  │
├──────────────────────────────────────────────────────────────┤
│ web_access     │      0   │ 0.1%  │ 44MB   │   0    │   —     │
│ parse_logs     │      0   │ 0.0%  │ 110MB  │   0    │   —     │
│ es_sink        │    180   │ 9.7%  │ 96MB   │  412   │  98%    │
└──────────────────────────────────────────────────────────────┘

The source reads 0 events/s, but es_sink is still draining slowly with a near-full buffer and a climbing error count. That is backpressure, not a crash. Confirm it with a single query against the Prometheus metrics endpoint:

curl -s http://localhost:8686/metrics | grep -E 'buffer_events|component_errors_total'

Expected Output:

vector_buffer_events{component_id="es_sink",stage="0"} 49984
vector_buffer_byte_size{component_id="es_sink"} 41284096
vector_component_errors_total{component_id="es_sink",error_type="request_failed"} 412

A vector_buffer_events value pinned near the buffer's max_events (here ~50k) plus a rising request_failed count is the definitive signature: the sink cannot keep up, the buffer is saturated, and the saturation has rippled back to the source.

Concept: Why the Source Stalls

Vector connects components with bounded buffers. By default these are in-memory buffers with when_full = "block". When the sink slows, the buffer fills; once full, "block" pauses the upstream component, which pauses its upstream, all the way to the file source. This is deliberate: blocking is how Vector guarantees at-least-once delivery without unbounded memory growth. The cost is that a slow Elasticsearch cluster can pause ingestion entirely. The fix is not to remove backpressure but to give it somewhere to go (a disk buffer) and to make the sink drain faster (concurrency and batching).

Step-by-Step Fix

Step 1: Measure component utilization to confirm the bottleneck. Before changing config, prove the sink is the constraint, not the transform. The utilization metric reports how busy each component is on a 0–1 scale.

curl -s http://localhost:8686/metrics | grep vector_utilization

Expected Output:

vector_utilization{component_id="parse_logs"} 0.07
vector_utilization{component_id="es_sink"} 0.99

Explanation: es_sink at 0.99 is saturated; parse_logs at 0.07 is idle. The sink is the bottleneck, so tuning the source or transform would be wasted effort.

Step 2: Switch the sink buffer to disk with an explicit policy. A disk buffer absorbs far more backlog than memory, letting the source keep reading during transient sink slowdowns. Set max_size in bytes and choose when_full deliberately.

[sinks.es_sink]
type = "elasticsearch"
inputs = ["parse_logs"]
endpoints = ["https://es.internal:9200"]
bulk.index = "crawl-logs-%Y.%m.%d"

[sinks.es_sink.buffer]
type = "disk"
max_size = 2147483648   # 2 GiB on-disk backlog
when_full = "block"     # never drop crawl events

Explanation: max_size is a byte count, not an event count, for disk buffers (minimum 256 MiB). With when_full = "block", a full buffer still pauses the source rather than discarding data, but 2 GiB of disk buys minutes of headroom instead of the seconds an in-memory buffer provides.

Production Warning: when_full = "drop_newest" discards incoming events the instant the buffer fills. On a crawl-analysis pipeline this silently deletes the Googlebot hits you are trying to measure, corrupting crawl-budget history with no error in the sink logs. Only use drop_newest for genuinely disposable, high-cardinality telemetry, never for the access-log stream that feeds SEO reporting. Default to block and size the disk buffer for your worst-case sink outage.

Step 3: Raise sink concurrency and tune batching. A blocked buffer often means the sink is under-parallelized. Let Vector adapt concurrency automatically and size batches so each request carries meaningful volume without timing out.

[sinks.es_sink.batch]
max_events = 4000
timeout_secs = 5

[sinks.es_sink.request]
concurrency = "adaptive"
retry_attempts = 10

Explanation: concurrency = "adaptive" lets Vector's AIMD controller probe for the highest request rate Elasticsearch tolerates, backing off automatically on 429 responses. Larger batches reduce per-request overhead; timeout_secs caps how long a partial batch waits before flushing.

Step 4: Enable end-to-end acknowledgements. Acknowledgements tie the file source's read cursor to confirmed sink delivery. The source only advances its checkpoint after the sink durably accepts the data, so a crash mid-backlog replays rather than drops.

[sources.web_access.acknowledgements]
enabled = true

Expected Output (after reload, vector top):

│ web_access     │  9,800   │ 2.0%  │ 47MB   │   0    │   —     │
│ es_sink        │  9,800   │ 7.1%  │ 88MB   │   0    │  12%    │

The source now matches the sink at ~9,800 events/s and buffer utilization has dropped from 98% to 12%.

Edge-Case Handling

Disk buffer fills during a multi-hour sink outage. Even 2 GiB eventually saturates if Elasticsearch is down for hours. With when_full = "block" the source stalls and the web server's own log file keeps growing on disk, so no events are lost; once the sink recovers, Vector drains the backlog. Pair this with generous on-disk log retention so the file source can replay rotated files. If you cannot tolerate any source stall, fan out to a second durable sink (for example S3 as a cold archive) so one slow destination never blocks the pipeline.

Adaptive concurrency oscillates under a flapping sink. If Elasticsearch returns intermittent 503s, adaptive concurrency can sawtooth. Confirm with vector_adaptive_concurrency_limit in the metrics; if it never stabilizes, pin a fixed concurrency = 8 as a floor and raise retry_max_duration_secs so transient failures retry instead of erroring out.

Verification

Confirm the fix held by checking that buffer events stay low and the error counter has stopped climbing under load.

curl -s http://localhost:8686/metrics | grep -E 'vector_buffer_events|component_errors_total' | grep es_sink

Expected Output:

vector_buffer_events{component_id="es_sink",stage="0"} 6021
vector_component_errors_total{component_id="es_sink",error_type="request_failed"} 412

The buffer is far below max_size and request_failed is flat (still 412 from the earlier incident, no new failures). A stable, non-climbing error count plus low buffer occupancy confirms backpressure is resolved.

Common Mistakes

  • Leaving the default in-memory buffer in production. A memory buffer holds only a few thousand events, so any sink hiccup stalls the source almost instantly. Always declare a disk buffer with a sized max_size on sinks that talk to remote services.
  • Choosing drop_newest to "stop the stalls." This trades a visible stall for invisible data loss. The pipeline looks healthy in vector top while crawl events vanish. Use block plus a larger disk buffer instead.
  • Sizing batches by event count without watching timeouts. Huge max_events with a long timeout_secs lets batches sit unflushed, inflating latency and buffer occupancy. Balance batch size against timeout_secs so events flush promptly under low traffic.

Frequently Asked Questions

How do I know if Vector is dropping events or just stalling?
Check vector_component_errors_total and your buffer's when_full setting. With when_full = "block" Vector stalls but does not drop, so a flat error counter plus a full buffer means a stall. A climbing discarded_events counter or drop_newest policy means actual loss.

Should the disk buffer live on the same volume as the logs?
Prefer a separate fast volume. Co-locating the buffer with the web server's log directory risks filling the disk that the source is reading from, which can crash the web server itself. Size max_size to your worst expected sink outage and monitor free space.

Does enabling acknowledgements slow the pipeline down?
Throughput cost is small, but acknowledgements require a buffer that supports them (disk buffers do). The benefit is that the file source only advances its read checkpoint after durable delivery, so a restart mid-backlog replays unconfirmed events rather than losing them.

Part of the Vector.dev Pipeline Configuration series.