Log Retention Policies

Log retention policy is the contract that decides how long every access and error line lives before it is compressed, archived, or destroyed. Set the window too short and you lose the historical crawl record you need to diagnose a Googlebot slowdown three months after it began; set it too long, or store it carelessly, and you breach data-minimization rules under the GDPR while quietly filling a production volume until inodes run out. A retention policy is where SEO diagnostics, storage economics, and privacy law collide on the same set of files. This guide builds a defensible, automated policy from those three constraints at once, and pairs every retention decision with the log rotation strategies that physically enforce it.

The objective is a documented, machine-enforced lifecycle: hot logs on fast disk for active analysis, warm compressed logs for recent audits, cold object storage for everything older, and a hard deletion boundary that satisfies your legal basis. By the end you will have retention windows mapped to crawl-cycle needs, an automated tier transition you can verify, an anonymization step that runs before anything leaves the box, and a troubleshooting runbook for the failures that silently delete or hoard data.

  • Set per-log-type retention windows from crawl frequency and legal basis
  • Automate hot → warm → cold → delete transitions with verifiable steps
  • Anonymize IPs before archival to keep cold storage GDPR-clean
  • Detect retention failures before they exhaust disk or lose SEO history

Why a Retention Policy Is a Lifecycle, Not a Number

People reach for a single number ("keep logs 90 days") and stop, but retention is a multi-phase lifecycle and each phase has a different cost, access pattern, and risk. An access log is read constantly in its first week, occasionally in its first quarter for an SEO audit, and almost never after that except for a forensic or compliance lookup. Storing all three access patterns on the same NVMe volume is wasteful; deleting at the first transition is destructive. The Information Lifecycle Management (ILM) model — hot, warm, cold, delete — exists precisely to match storage class to access pattern over time.

The diagram below shows that lifecycle as a timeline, with the SEO diagnostic windows and the GDPR pressure annotated on the same axis. The SEO need pulls retention longer (capture a full algorithmic update cycle); the privacy principle of storage limitation pushes it shorter (do not keep personal data, which a raw IP is, longer than the stated purpose requires). Your policy is the negotiated settlement between those two arrows, and the right answer is usually "keep, but anonymize early and tier aggressively." For the legal framing of why a client IP counts as personal data, see privacy and GDPR compliance for logs.

Log retention ILM timeline: hot, warm, cold, delete A horizontal timeline from day zero to deletion, split into hot, warm, and cold storage phases, with the SEO diagnostic window and the GDPR storage-limitation pressure annotated above and below the axis. Log retention lifecycle (ILM) HOT SSD, 0-14 days WARM gzip on disk, to 90d COLD object store, to 180d DELETE purge + audit log day 0 14d 90d 180d SEO diagnostic window: keep crawl history GDPR storage limitation: anonymize early, shorten raw window Anonymize IPs at the hot-to-warm boundary so cold storage holds no raw personal data. Retention = the overlap where SEO value still justifies keeping anonymized records.

Prerequisites

Before you encode a retention policy, confirm the operational and legal groundwork is in place:

  • Root or sudo on the log host and permission to edit /etc/logrotate.d/ and deploy cron or systemd units.
  • A documented legal basis for keeping access logs (legitimate interest for security and analytics is common) and a stated retention period in your privacy notice. The policy you build must match what you published.
  • logrotate 3.14+ (logrotate --version) for reliable dateext and maxage behavior.
  • An object-storage target (S3, GCS, or compatible) with a lifecycle rule capability, plus credentials scoped to write-and-tag only.
  • Knowledge of what each field means before you decide what to keep or mask. Review log field interpretation and decoding so you anonymize the right column and retain the diagnostically useful ones.

Defining Retention Windows for Crawl Analysis

Retention windows are not arbitrary. Derive each one from a concrete diagnostic question and the longest answer-lookback that question needs, then check it against your legal basis. Googlebot crawl behavior moves on cycles measured in weeks to a quarter, and core algorithm updates roll out and settle over a similar horizon, so a 90-day floor on access logs lets you correlate a ranking or crawl-rate shift with the log evidence after the fact. Error logs answer a shorter question — "why did 5xx spike last Tuesday" — and rarely need to outlive the incident plus a review cycle.

Step 1: Map each log type to a question, a window, and a legal note. The table below is the policy's core. Treat the retention column as a maximum for raw, identifiable data and a minimum for the anonymized derivative you keep for trends.

Log type Retention window Diagnostic question it answers Privacy note
Access logs (raw, with IP) 14–30 days hot Live crawl triage, fake-bot checks needing the source IP Shortest possible; IP is personal data
Access logs (anonymized) 90–180 days warm/cold Crawl-frequency trends, seasonal and update-cycle analysis Safe to keep longer once IP is masked
Error logs 30–60 days 5xx/4xx spike correlation, fault isolation Purge once the incident is closed and reviewed
Debug / verbose 3–14 days Active troubleshooting only Highest volume; delete immediately after use
Security / audit 180–365 days Forensic timeline, access review Often a separate legal basis; keep segregated

Step 2: Encode the window so it is enforced, not aspirational. A policy in a wiki does nothing; the rotate count and maxage in logrotate are what actually delete files. Express the warm-tier window in the rotation config and let archival handle the cold tier.

# Confirm the policy you THINK is active matches what logrotate will enforce
grep -E 'rotate|maxage|daily|weekly' /etc/logrotate.d/nginx-custom

Expected Output:

    daily
    rotate 90
    maxage 90

If rotate and maxage disagree with the table above, the file lies about your policy. Reconcile them before going further. The mechanics of choosing rotate versus size triggers and handling reloads safely belong to log rotation strategies; this page decides the numbers, that page enforces them.

Production Warning: Never retain raw, IP-bearing logs indefinitely on a primary NVMe or SSD volume. Unchecked growth exhausts inodes and degrades disk I/O, which raises response times and directly shrinks effective crawl budget — and it leaves identifiable personal data sitting far past your stated retention period, which is itself a compliance finding.

Automating the Hot → Warm → Cold → Delete Transition

With windows defined, automate the movement between tiers so no human has to remember to delete anything. Three jobs do the work: rotation compresses hot to warm, an anonymize-and-archive script promotes warm to cold while stripping IPs, and an object-store lifecycle rule handles cold-to-delete. Each must be independently verifiable.

1. Compress hot to warm with rotation. This is the boundary where you also enforce the raw-retention ceiling. Keep the uncompressed hot window short, then compress.

/var/log/nginx/*.log {
    daily
    rotate 90
    maxage 90
    compress
    delaycompress
    missingok
    notifempty
    create 0640 www-data adm
    dateext
    sharedscripts
    postrotate
        [ -f /var/run/nginx.pid ] && kill -USR1 $(cat /var/run/nginx.pid)
    endscript
}

Explanation: rotate 90 plus maxage 90 caps the on-disk warm tier at roughly 90 days; delaycompress keeps the most recent rotation readable for shipping; the USR1 signal makes Nginx reopen its file handle so it does not keep writing to the rotated inode.

Production Warning: postrotate runs a live signal against a running web server. Test the exact kill -USR1 line on staging first. A typo that sends the wrong signal (for example SIGTERM) will stop Nginx and take the site down at 3 a.m. when the timer fires.

2. Anonymize, then archive, at the warm-to-cold boundary. Strip the IP before the data leaves the host, so cold storage never holds raw personal data. The script anonymizes the oldest rotated file, ships the masked copy to object storage with compliance tags, and only removes the local original after the upload is confirmed.

#!/bin/bash
set -euo pipefail

ARCHIVE_DATE=$(date -d "90 days ago" +%Y-%m-%d)
SOURCE="/var/log/nginx/access.log-${ARCHIVE_DATE}.gz"
ANON="/var/log/archive/access-${ARCHIVE_DATE}.anon.gz"
BUCKET="s3://log-archive-bucket/nginx"

[ -f "$SOURCE" ] || { echo "No source for ${ARCHIVE_DATE}"; exit 0; }

# Mask the final IPv4 octet before anything leaves the box
zcat "$SOURCE" | sed -E 's/^([0-9]+\.[0-9]+\.[0-9]+)\.[0-9]+/\1.0/' | gzip > "$ANON"

aws s3 cp "$ANON" "${BUCKET}/" \
    --storage-class GLACIER \
    --tagging "retention=180d&compliance=gdpr&anonymized=true"

# Only after a verified upload do we drop the local copies
aws s3api head-object --bucket log-archive-bucket \
    --key "nginx/$(basename "$ANON")" >/dev/null \
    && rm -f "$SOURCE" "$ANON" \
    && echo "$(date -u +%Y-%m-%dT%H:%M:%SZ): archived+purged ${SOURCE}" >> /var/log/archive_audit.log

Expected Output:

2026-06-19T03:21:08Z: archived+purged /var/log/nginx/access.log-2026-03-21.gz

Safety Note: The head-object check is the guardrail that prevents irreversible loss. Never rm the local original on the basis of a successful cp exit code alone — a partial multipart upload can exit zero. Verify the object exists in the bucket first, then delete. Keep a 24-hour grace period if you can afford the disk.

3. Let the object store handle cold-to-delete. Do not script deletion of cold data; encode it as a bucket lifecycle rule so the retention boundary is declarative and auditable.

{
  "Rules": [{
    "ID": "expire-nginx-logs-180d",
    "Filter": { "Prefix": "nginx/" },
    "Status": "Enabled",
    "Expiration": { "Days": 180 }
  }]
}

Expected Output: apply and confirm the rule is active.

aws s3api put-bucket-lifecycle-configuration \
  --bucket log-archive-bucket --lifecycle-configuration file://lifecycle.json
aws s3api get-bucket-lifecycle-configuration --bucket log-archive-bucket \
  --query 'Rules[0].Expiration.Days'
180

For the deeper archival tier design — storage classes, retrieval SLAs, and integrity checks — see log storage and archival best practices.

Monitoring & Troubleshooting Retention Failures

A retention policy fails silently in both directions: it either deletes too much (you lose crawl history) or too little (disk fills, raw IPs linger past their window). Run these checks as a recurring gate, not a one-off.

Failure mode 1: Rotation stopped, disk filling. The most urgent failure. Confirm inode and space headroom and that rotation actually ran recently.

df -i /var/log; df -h /var/log
grep -h logrotate /var/log/syslog | tail -3

Detection: inode or space use climbing toward 100%, or no recent logrotate syslog entry. Fix: run sudo logrotate -f /etc/logrotate.d/nginx-custom, then find why the timer/cron stopped (a failing postrotate aborts the whole run, so check that signal line first).

Failure mode 2: Files deleted too early, crawl history gone. Usually a rotate count far below the policy table, or a stray maxage shorter than intended.

ls -1 /var/log/nginx/access.log-*.gz | wc -l
grep -E 'rotate|maxage' /etc/logrotate.d/nginx-custom

Detection: fewer archived files than your warm window should produce. Fix: raise rotate/maxage to match the table and confirm the archive job, not rotation, owns the oldest files. Compare against the SEO-specific guidance in Nginx log retention best practices for SEO.

Failure mode 3: Raw IPs reaching cold storage. The anonymize step was skipped or its regex missed IPv6. Audit a sample of what actually landed in the bucket.

aws s3 cp s3://log-archive-bucket/nginx/access-2026-03-21.anon.gz - \
  | zcat | grep -Ec '([0-9]+\.){3}[0-9]+|[0-9a-fA-F:]{4,}:'

Detection: any non-zero count for full IPv4 (ending in a non-.0 octet) or IPv6 means raw addresses leaked. Fix: extend the mask to cover IPv6 and re-process; treat the unmasked object as a reportable retention defect. The masking patterns live in privacy and GDPR compliance for logs.

Failure mode 4: Shipper tailing a dead inode after rotation. Rotation succeeded but your log shipper kept the old file handle, so warm data never reaches the pipeline.

sudo lsof +D /var/log/nginx/ | grep -i deleted

Detection: a (deleted) entry held open by a shipper. Fix: ensure the shipper follows by inode (Filebeat, Vector) or use tail -F, and confirm the postrotate reload fired.

Aligning Retention with Compliance & SEO Workflows

The policy only holds if it is wired into your operating rhythm and your paperwork. Two integrations matter most. First, the data-minimization integration: hash or mask IPs at, or before, the hot-to-warm boundary so that everything downstream of the first transition is already anonymized — this is what lets you keep 180 days of trend data without keeping 180 days of personal data. Second, the audit integration: map retention windows to your quarterly SEO audit calendar so the data a planned audit needs is guaranteed to still exist when the audit runs, and document each lifecycle rule so a compliance review can trace the path from collection to deletion.

Workflow integration checklist:

  • [ ] Anonymize IPs at ingestion or the first rotation, before warm storage
  • [ ] Map retention windows to the quarterly SEO audit schedule so audits never hit deleted data
  • [ ] Document every lifecycle rule (config-as-code) for compliance review
  • [ ] Record retrieval SLAs: Glacier Flexible Retrieval 3–5 hours; Glacier Instant Retrieval milliseconds
  • [ ] Keep an append-only deletion audit log proving the policy executed
Storage tier Class / location Typical retrieval Cost profile
Hot Local SSD/NVMe Instant Highest per GB
Warm Local disk, gzip Instant (decompress) Moderate
Cold Glacier Instant Retrieval Milliseconds Low storage, higher request
Archive Glacier Flexible Retrieval 3–5 hours Lowest storage

Production Warning: Before you rely on cold storage for incident response, prove a restore actually works. Run aws s3api restore-object on a sample and time it. Discovering during a live SEO or security incident that retrieval takes five hours, or that the object expired yesterday under a lifecycle rule you forgot, is the worst time to learn your retention policy.

Common Mistakes

  • Treating retention as a single number. "Keep 90 days" ignores that hot, warm, and cold data have different costs and access patterns. Fix: define a per-tier lifecycle with distinct windows for raw versus anonymized data.
  • Retaining raw, IP-bearing logs on the primary volume indefinitely. Causes inode exhaustion and keeps personal data past its stated window. Fix: enforce rotate/maxage ceilings and offload to object storage with anonymization on the way out.
  • Deleting local logs on a successful copy exit code alone. A partial upload can exit zero, so you delete the only copy. Fix: verify the object exists with head-object before any rm, and keep a grace period.
  • Anonymizing only IPv4. A sed that masks a.b.c.d silently passes every IPv6 address straight to cold storage. Fix: mask IPv6 too and audit a sample of archived objects for leaked addresses.
  • Letting postrotate reload break the whole run. A bad signal line aborts rotation, so disk fills while you think the policy is working. Fix: test the exact signal command on staging and monitor syslog for aborted runs.

Frequently Asked Questions

What is the optimal log retention period for SEO crawl analysis?
Keep 90 to 180 days of access-log data so you can correlate crawl-rate and ranking shifts with full algorithm-update and seasonal cycles. Keep the raw, IP-bearing form for only 14 to 30 days, then retain the anonymized derivative for the longer window so you satisfy storage-limitation rules without losing trend history.

Do server access logs fall under the GDPR, and how does that affect retention?
Yes. A client IP address is personal data, so logs containing it are in scope and the storage-limitation principle applies: you must not keep them longer than your stated purpose requires. The practical answer is to anonymize early (mask the IP at the first rotation) and only keep the raw form for the short window where the source IP is genuinely needed, such as fake-bot verification.

Can I delete logs immediately after archiving them?
Only after verifying the archive landed and is queryable. Confirm the object exists in the bucket with head-object (not just a zero exit code from the copy), keep a 24-hour grace period when disk allows, and write the deletion to an append-only audit log. Premature deletion on an unverified upload is the most common way to lose data irreversibly.

How do retention policies impact crawl budget optimization?
Two ways. Operationally, unchecked log growth exhausts inodes and degrades disk I/O, raising response times and shrinking effective crawl budget. Analytically, a long-enough retention window keeps the historical bot-behavior record you need to diagnose a crawl anomaly weeks after it starts; too short a window and the evidence is already deleted when you go looking.

Part of the Server Log Fundamentals & Compliance series.