Log Retention Policies
Log retention policy is the contract that decides how long every access and error line lives before it is compressed, archived, or destroyed. Set the window too short and you lose the historical crawl record you need to diagnose a Googlebot slowdown three months after it began; set it too long, or store it carelessly, and you breach data-minimization rules under the GDPR while quietly filling a production volume until inodes run out. A retention policy is where SEO diagnostics, storage economics, and privacy law collide on the same set of files. This guide builds a defensible, automated policy from those three constraints at once, and pairs every retention decision with the log rotation strategies that physically enforce it.
The objective is a documented, machine-enforced lifecycle: hot logs on fast disk for active analysis, warm compressed logs for recent audits, cold object storage for everything older, and a hard deletion boundary that satisfies your legal basis. By the end you will have retention windows mapped to crawl-cycle needs, an automated tier transition you can verify, an anonymization step that runs before anything leaves the box, and a troubleshooting runbook for the failures that silently delete or hoard data.
- Set per-log-type retention windows from crawl frequency and legal basis
- Automate hot → warm → cold → delete transitions with verifiable steps
- Anonymize IPs before archival to keep cold storage GDPR-clean
- Detect retention failures before they exhaust disk or lose SEO history
Why a Retention Policy Is a Lifecycle, Not a Number
People reach for a single number ("keep logs 90 days") and stop, but retention is a multi-phase lifecycle and each phase has a different cost, access pattern, and risk. An access log is read constantly in its first week, occasionally in its first quarter for an SEO audit, and almost never after that except for a forensic or compliance lookup. Storing all three access patterns on the same NVMe volume is wasteful; deleting at the first transition is destructive. The Information Lifecycle Management (ILM) model — hot, warm, cold, delete — exists precisely to match storage class to access pattern over time.
The diagram below shows that lifecycle as a timeline, with the SEO diagnostic windows and the GDPR pressure annotated on the same axis. The SEO need pulls retention longer (capture a full algorithmic update cycle); the privacy principle of storage limitation pushes it shorter (do not keep personal data, which a raw IP is, longer than the stated purpose requires). Your policy is the negotiated settlement between those two arrows, and the right answer is usually "keep, but anonymize early and tier aggressively." For the legal framing of why a client IP counts as personal data, see privacy and GDPR compliance for logs.
Prerequisites
Before you encode a retention policy, confirm the operational and legal groundwork is in place:
- Root or sudo on the log host and permission to edit
/etc/logrotate.d/and deploy cron or systemd units. - A documented legal basis for keeping access logs (legitimate interest for security and analytics is common) and a stated retention period in your privacy notice. The policy you build must match what you published.
logrotate3.14+ (logrotate --version) for reliabledateextandmaxagebehavior.- An object-storage target (S3, GCS, or compatible) with a lifecycle rule capability, plus credentials scoped to write-and-tag only.
- Knowledge of what each field means before you decide what to keep or mask. Review log field interpretation and decoding so you anonymize the right column and retain the diagnostically useful ones.
Defining Retention Windows for Crawl Analysis
Retention windows are not arbitrary. Derive each one from a concrete diagnostic question and the longest answer-lookback that question needs, then check it against your legal basis. Googlebot crawl behavior moves on cycles measured in weeks to a quarter, and core algorithm updates roll out and settle over a similar horizon, so a 90-day floor on access logs lets you correlate a ranking or crawl-rate shift with the log evidence after the fact. Error logs answer a shorter question — "why did 5xx spike last Tuesday" — and rarely need to outlive the incident plus a review cycle.
Step 1: Map each log type to a question, a window, and a legal note. The table below is the policy's core. Treat the retention column as a maximum for raw, identifiable data and a minimum for the anonymized derivative you keep for trends.
| Log type | Retention window | Diagnostic question it answers | Privacy note |
|---|---|---|---|
| Access logs (raw, with IP) | 14–30 days hot | Live crawl triage, fake-bot checks needing the source IP | Shortest possible; IP is personal data |
| Access logs (anonymized) | 90–180 days warm/cold | Crawl-frequency trends, seasonal and update-cycle analysis | Safe to keep longer once IP is masked |
| Error logs | 30–60 days | 5xx/4xx spike correlation, fault isolation | Purge once the incident is closed and reviewed |
| Debug / verbose | 3–14 days | Active troubleshooting only | Highest volume; delete immediately after use |
| Security / audit | 180–365 days | Forensic timeline, access review | Often a separate legal basis; keep segregated |
Step 2: Encode the window so it is enforced, not aspirational. A policy in a wiki does nothing; the rotate count and maxage in logrotate are what actually delete files. Express the warm-tier window in the rotation config and let archival handle the cold tier.
# Confirm the policy you THINK is active matches what logrotate will enforce
grep -E 'rotate|maxage|daily|weekly' /etc/logrotate.d/nginx-custom
Expected Output:
daily
rotate 90
maxage 90
If rotate and maxage disagree with the table above, the file lies about your policy. Reconcile them before going further. The mechanics of choosing rotate versus size triggers and handling reloads safely belong to log rotation strategies; this page decides the numbers, that page enforces them.
Production Warning: Never retain raw, IP-bearing logs indefinitely on a primary NVMe or SSD volume. Unchecked growth exhausts inodes and degrades disk I/O, which raises response times and directly shrinks effective crawl budget — and it leaves identifiable personal data sitting far past your stated retention period, which is itself a compliance finding.
Automating the Hot → Warm → Cold → Delete Transition
With windows defined, automate the movement between tiers so no human has to remember to delete anything. Three jobs do the work: rotation compresses hot to warm, an anonymize-and-archive script promotes warm to cold while stripping IPs, and an object-store lifecycle rule handles cold-to-delete. Each must be independently verifiable.
1. Compress hot to warm with rotation. This is the boundary where you also enforce the raw-retention ceiling. Keep the uncompressed hot window short, then compress.
/var/log/nginx/*.log {
daily
rotate 90
maxage 90
compress
delaycompress
missingok
notifempty
create 0640 www-data adm
dateext
sharedscripts
postrotate
[ -f /var/run/nginx.pid ] && kill -USR1 $(cat /var/run/nginx.pid)
endscript
}
Explanation: rotate 90 plus maxage 90 caps the on-disk warm tier at roughly 90 days; delaycompress keeps the most recent rotation readable for shipping; the USR1 signal makes Nginx reopen its file handle so it does not keep writing to the rotated inode.
Production Warning: postrotate runs a live signal against a running web server. Test the exact kill -USR1 line on staging first. A typo that sends the wrong signal (for example SIGTERM) will stop Nginx and take the site down at 3 a.m. when the timer fires.
2. Anonymize, then archive, at the warm-to-cold boundary. Strip the IP before the data leaves the host, so cold storage never holds raw personal data. The script anonymizes the oldest rotated file, ships the masked copy to object storage with compliance tags, and only removes the local original after the upload is confirmed.
#!/bin/bash
set -euo pipefail
ARCHIVE_DATE=$(date -d "90 days ago" +%Y-%m-%d)
SOURCE="/var/log/nginx/access.log-${ARCHIVE_DATE}.gz"
ANON="/var/log/archive/access-${ARCHIVE_DATE}.anon.gz"
BUCKET="s3://log-archive-bucket/nginx"
[ -f "$SOURCE" ] || { echo "No source for ${ARCHIVE_DATE}"; exit 0; }
# Mask the final IPv4 octet before anything leaves the box
zcat "$SOURCE" | sed -E 's/^([0-9]+\.[0-9]+\.[0-9]+)\.[0-9]+/\1.0/' | gzip > "$ANON"
aws s3 cp "$ANON" "${BUCKET}/" \
--storage-class GLACIER \
--tagging "retention=180d&compliance=gdpr&anonymized=true"
# Only after a verified upload do we drop the local copies
aws s3api head-object --bucket log-archive-bucket \
--key "nginx/$(basename "$ANON")" >/dev/null \
&& rm -f "$SOURCE" "$ANON" \
&& echo "$(date -u +%Y-%m-%dT%H:%M:%SZ): archived+purged ${SOURCE}" >> /var/log/archive_audit.log
Expected Output:
2026-06-19T03:21:08Z: archived+purged /var/log/nginx/access.log-2026-03-21.gz
Safety Note: The head-object check is the guardrail that prevents irreversible loss. Never rm the local original on the basis of a successful cp exit code alone — a partial multipart upload can exit zero. Verify the object exists in the bucket first, then delete. Keep a 24-hour grace period if you can afford the disk.
3. Let the object store handle cold-to-delete. Do not script deletion of cold data; encode it as a bucket lifecycle rule so the retention boundary is declarative and auditable.
{
"Rules": [{
"ID": "expire-nginx-logs-180d",
"Filter": { "Prefix": "nginx/" },
"Status": "Enabled",
"Expiration": { "Days": 180 }
}]
}
Expected Output: apply and confirm the rule is active.
aws s3api put-bucket-lifecycle-configuration \
--bucket log-archive-bucket --lifecycle-configuration file://lifecycle.json
aws s3api get-bucket-lifecycle-configuration --bucket log-archive-bucket \
--query 'Rules[0].Expiration.Days'
180
For the deeper archival tier design — storage classes, retrieval SLAs, and integrity checks — see log storage and archival best practices.
Monitoring & Troubleshooting Retention Failures
A retention policy fails silently in both directions: it either deletes too much (you lose crawl history) or too little (disk fills, raw IPs linger past their window). Run these checks as a recurring gate, not a one-off.
Failure mode 1: Rotation stopped, disk filling. The most urgent failure. Confirm inode and space headroom and that rotation actually ran recently.
df -i /var/log; df -h /var/log
grep -h logrotate /var/log/syslog | tail -3
Detection: inode or space use climbing toward 100%, or no recent logrotate syslog entry. Fix: run sudo logrotate -f /etc/logrotate.d/nginx-custom, then find why the timer/cron stopped (a failing postrotate aborts the whole run, so check that signal line first).
Failure mode 2: Files deleted too early, crawl history gone. Usually a rotate count far below the policy table, or a stray maxage shorter than intended.
ls -1 /var/log/nginx/access.log-*.gz | wc -l
grep -E 'rotate|maxage' /etc/logrotate.d/nginx-custom
Detection: fewer archived files than your warm window should produce. Fix: raise rotate/maxage to match the table and confirm the archive job, not rotation, owns the oldest files. Compare against the SEO-specific guidance in Nginx log retention best practices for SEO.
Failure mode 3: Raw IPs reaching cold storage. The anonymize step was skipped or its regex missed IPv6. Audit a sample of what actually landed in the bucket.
aws s3 cp s3://log-archive-bucket/nginx/access-2026-03-21.anon.gz - \
| zcat | grep -Ec '([0-9]+\.){3}[0-9]+|[0-9a-fA-F:]{4,}:'
Detection: any non-zero count for full IPv4 (ending in a non-.0 octet) or IPv6 means raw addresses leaked. Fix: extend the mask to cover IPv6 and re-process; treat the unmasked object as a reportable retention defect. The masking patterns live in privacy and GDPR compliance for logs.
Failure mode 4: Shipper tailing a dead inode after rotation. Rotation succeeded but your log shipper kept the old file handle, so warm data never reaches the pipeline.
sudo lsof +D /var/log/nginx/ | grep -i deleted
Detection: a (deleted) entry held open by a shipper. Fix: ensure the shipper follows by inode (Filebeat, Vector) or use tail -F, and confirm the postrotate reload fired.
Aligning Retention with Compliance & SEO Workflows
The policy only holds if it is wired into your operating rhythm and your paperwork. Two integrations matter most. First, the data-minimization integration: hash or mask IPs at, or before, the hot-to-warm boundary so that everything downstream of the first transition is already anonymized — this is what lets you keep 180 days of trend data without keeping 180 days of personal data. Second, the audit integration: map retention windows to your quarterly SEO audit calendar so the data a planned audit needs is guaranteed to still exist when the audit runs, and document each lifecycle rule so a compliance review can trace the path from collection to deletion.
Workflow integration checklist:
- [ ] Anonymize IPs at ingestion or the first rotation, before warm storage
- [ ] Map retention windows to the quarterly SEO audit schedule so audits never hit deleted data
- [ ] Document every lifecycle rule (config-as-code) for compliance review
- [ ] Record retrieval SLAs: Glacier Flexible Retrieval 3–5 hours; Glacier Instant Retrieval milliseconds
- [ ] Keep an append-only deletion audit log proving the policy executed
| Storage tier | Class / location | Typical retrieval | Cost profile |
|---|---|---|---|
| Hot | Local SSD/NVMe | Instant | Highest per GB |
| Warm | Local disk, gzip | Instant (decompress) | Moderate |
| Cold | Glacier Instant Retrieval | Milliseconds | Low storage, higher request |
| Archive | Glacier Flexible Retrieval | 3–5 hours | Lowest storage |
Production Warning: Before you rely on cold storage for incident response, prove a restore actually works. Run aws s3api restore-object on a sample and time it. Discovering during a live SEO or security incident that retrieval takes five hours, or that the object expired yesterday under a lifecycle rule you forgot, is the worst time to learn your retention policy.
Common Mistakes
- Treating retention as a single number. "Keep 90 days" ignores that hot, warm, and cold data have different costs and access patterns. Fix: define a per-tier lifecycle with distinct windows for raw versus anonymized data.
- Retaining raw, IP-bearing logs on the primary volume indefinitely. Causes inode exhaustion and keeps personal data past its stated window. Fix: enforce
rotate/maxageceilings and offload to object storage with anonymization on the way out. - Deleting local logs on a successful copy exit code alone. A partial upload can exit zero, so you delete the only copy. Fix: verify the object exists with
head-objectbefore anyrm, and keep a grace period. - Anonymizing only IPv4. A
sedthat masksa.b.c.dsilently passes every IPv6 address straight to cold storage. Fix: mask IPv6 too and audit a sample of archived objects for leaked addresses. - Letting
postrotatereload break the whole run. A bad signal line aborts rotation, so disk fills while you think the policy is working. Fix: test the exact signal command on staging and monitor syslog for aborted runs.
Frequently Asked Questions
What is the optimal log retention period for SEO crawl analysis?
Keep 90 to 180 days of access-log data so you can correlate crawl-rate and ranking shifts with full algorithm-update and seasonal cycles. Keep the raw, IP-bearing form for only 14 to 30 days, then retain the anonymized derivative for the longer window so you satisfy storage-limitation rules without losing trend history.
Do server access logs fall under the GDPR, and how does that affect retention?
Yes. A client IP address is personal data, so logs containing it are in scope and the storage-limitation principle applies: you must not keep them longer than your stated purpose requires. The practical answer is to anonymize early (mask the IP at the first rotation) and only keep the raw form for the short window where the source IP is genuinely needed, such as fake-bot verification.
Can I delete logs immediately after archiving them?
Only after verifying the archive landed and is queryable. Confirm the object exists in the bucket with head-object (not just a zero exit code from the copy), keep a 24-hour grace period when disk allows, and write the deletion to an append-only audit log. Premature deletion on an unverified upload is the most common way to lose data irreversibly.
How do retention policies impact crawl budget optimization?
Two ways. Operationally, unchecked log growth exhausts inodes and degrades disk I/O, raising response times and shrinking effective crawl budget. Analytically, a long-enough retention window keeps the historical bot-behavior record you need to diagnose a crawl anomaly weeks after it starts; too short a window and the evidence is already deleted when you go looking.
Related Guides
- Log Rotation Strategies — the rotation mechanics that physically enforce the windows defined here.
- Nginx Log Retention Best Practices for SEO — the Nginx-specific tuning of this policy for crawl analysis.
- Log Storage & Archival Best Practices — designing the cold tier, storage classes, and retrieval SLAs.
- Privacy & GDPR Compliance for Logs — why an IP is personal data and how to anonymize before archival.
- Log Field Interpretation & Decoding — knowing which field to mask and which to keep for diagnostics.
Part of the Server Log Fundamentals & Compliance series.