Nginx Log Retention Best Practices for SEO
Accurate crawl budget optimization relies on uninterrupted server log analysis, and that depends entirely on how long — and how cleanly — you keep your Nginx access logs. When retention is misconfigured, SEO teams hit three failure modes at once: data gaps where bot requests vanish before a parser reads them, disk exhaustion that silently halts ingestion, and skewed crawl metrics caused by premature truncation. This guide gives you exact Nginx and logrotate configuration, anonymization at write-time, and validation commands to maintain compliant, SEO-ready archives without dropping a single Googlebot hit.
The goal is a retention window long enough to survive batched parser cycles, short enough to satisfy privacy law, and structured so historical crawl-trend analysis stays intact. Align this with your broader log retention policies so crawler behavior remains fully traceable, and lean on solid log rotation strategies so the rotation that enforces retention never drops requests in the act.
Diagnosis: Confirming a Retention-Induced Data Gap
The symptom that sends most teams here is an SEO dashboard reporting artificially low Googlebot crawl rates despite verified live traffic. Before you touch any config, confirm the gap is real and caused by retention rather than by a parser bug. Count Googlebot hits per rotated file and look for a cliff:
for f in /var/log/nginx/access.log*; do
printf '%s\t%s\n' "$f" "$(zgrep -c -i googlebot "$f" 2>/dev/null || grep -c -i googlebot "$f")"
done
Expected Output:
/var/log/nginx/access.log 14820
/var/log/nginx/access.log.1 13977
/var/log/nginx/access.log.2.gz 14102
/var/log/nginx/access.log.3.gz 0
/var/log/nginx/access.log.4.gz 0
The two trailing 0 files are the tell: rotation is producing files, but Googlebot data stops three days back. Either logs are being deleted before the parser reaches them, or compression mangled the archive. A healthy retention setup shows a roughly flat bot count across every retained file.
Check disk and inode pressure too, since a full filesystem is the other common cause of a silent ingestion halt:
df -h /var/log && df -i /var/log
Expected Output:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 49G 1.1G 98% /var/log
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 3.2M 3.1M 98K 97% /var/log
At 98% blocks and 97% inodes, Nginx will soon fail to open new log files. Both numbers must come down before retention can be trusted.
Concept: Why Retention and Parser Cadence Collide
Retention is not just "how many days to keep." It is a contract between three clocks that rarely tick together: Nginx writes continuously, logrotate rotates on a fixed schedule or size trigger, and your log analyzer ingests in batches that often lag hours behind real time. If rotate or maxage deletes a file before the batch parser has read it, the crawl data in that file is gone forever — there is no replaying an access log.
The fix is a tiered window. Keep raw, uncompressed logs long enough to outlast the slowest parser cycle (commonly 14 days), then compress and hold for historical trend analysis (commonly 90 days), then delete. Delayed compression keeps the most recent file readable for real-time tooling while everything older is squeezed for storage. Privacy law adds a fourth constraint: raw IPs are personal data, so you anonymize at write-time rather than retaining identifiable data you would later have to purge. This is the same tiering logic behind log storage and archival best practices, applied specifically to keeping crawl telemetry intact.
Step-by-Step: A Retention-Safe Nginx Configuration
Step 1: Anonymize IPs at write-time.
Stripping the host portion before the line is written means you never retain personal data, so retention length is bounded only by SEO need, not by privacy risk. Add a map and a custom log format.
# /etc/nginx/conf.d/anonymize_ip.conf
map $remote_addr $anonymized_ip {
~(?P<ip>\d+\.\d+\.\d+)\.\d+ $ip.0;
~(?P<ip>[0-9a-fA-F:]+):[0-9a-fA-F]{1,4}$ $ip:0;
default 0.0.0.0;
}
log_format combined_anon '$anonymized_ip - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';
access_log /var/log/nginx/access.log combined_anon;
Validate the config before reloading.
sudo nginx -t
Expected Output:
nginx: configuration file /etc/nginx/nginx.conf test is successful
This zeroes the last IPv4 octet and the trailing IPv6 group while preserving path, status, and user-agent — everything crawl analysis needs. Tie the privacy rationale back to your GDPR compliance controls so retention length is defensible.
Step 2: Define the tiered logrotate policy.
This is the core retention rule: 14 days raw, compressed thereafter, with a hard ceiling.
# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
daily
missingok
rotate 90
compress
delaycompress
notifempty
maxage 90
dateext
create 0640 www-data adm
sharedscripts
postrotate
[ -f /var/run/nginx.pid ] && kill -USR1 $(cat /var/run/nginx.pid)
endscript
}
Explanation: rotate 90 keeps 90 generations, maxage 90 deletes anything older even if the count is not reached, and delaycompress leaves the most recent rotated file uncompressed for real-time parsers. The postrotate block sends SIGUSR1 so Nginx reopens its log files gracefully. dateext names archives by date so a parser can map files to days unambiguously.
Step 3: Dry-run the rotation before trusting it.
Never deploy a retention change without simulating it first.
sudo logrotate -d /etc/logrotate.d/nginx
Expected Output:
rotating pattern: /var/log/nginx/*.log after 1 days (90 rotations)
considering log /var/log/nginx/access.log
log needs rotating
rotating log /var/log/nginx/access.log, log->rotateCount is 90
... (no errors)
Production Warning: logrotate -d is debug/dry-run and changes nothing, but the very next step (-f) forces a real rotation and fires postrotate. Run the forced rotation only in a maintenance window, and confirm nginx -t passed first — a broken config plus a forced reload can take the site down.
Step 4: Force one real rotation and confirm Nginx reopened cleanly.
sudo logrotate -f /etc/logrotate.d/nginx
sudo tail -n 2 /var/log/nginx/error.log
Expected Output:
2026/06/19 03:00:01 [notice] 8123#8123: signal 10 (SIGUSR1) received, reopening logs
2026/06/19 03:00:01 [notice] 8123#8123: reopened "/var/log/nginx/access.log"
The reopened line proves Nginx now writes to the fresh file rather than the renamed .1. If you see no such line, the postrotate signal did not land and new writes are going to the rotated file — the exact cause of post-rotation bot gaps.
Edge Cases
Parser still mid-ingestion when rotation fires. Batched analyzers can lag the rotation clock. Gate deletion on a parser-completion flag rather than time alone. Have your ingestion job touch a marker, and refuse to archive files newer than the marker:
# parser writes this on successful ingestion of a day's file
touch /var/log/nginx/.ingested-$(date +%F)
# archival job skips any access.log not yet marked ingested
[ -f "/var/log/nginx/.ingested-$(date +%F -d yesterday)" ] || exit 0
This makes retention parser-aware: a slow batch run delays cleanup instead of losing data.
Compressed archives unreadable by the analyzer. If zgrep on a .gz returns binary garbage or zero hits where the raw file had thousands, compression corrupted mid-write — usually because rotation fired while Nginx still held the descriptor. delaycompress avoids this by never compressing the file Nginx might still be writing. Confirm with zcat access.log.2.gz | head returning readable log lines.
Verification
Prove the full retention window is intact and parseable end to end with one command that counts Googlebot hits across every retained generation, decompressing as needed:
zgrep -c -i googlebot /var/log/nginx/access.log* 2>/dev/null | sort -t: -k2 -n | tail
Expected Output:
/var/log/nginx/access.log.88.gz:13980
/var/log/nginx/access.log.2.gz:14102
/var/log/nginx/access.log.1:13977
/var/log/nginx/access.log:14820
Every generation reports a healthy, comparable bot count with no zero-files and no gaps — retention is keeping crawl data continuous from today back through the full window.
Common Mistakes
- Deleting raw logs before the parser finishes. Setting
rotate/maxagepurely by storage budget ignores parser lag and silently erases crawl telemetry. Gate cleanup on an ingestion-completion marker (Step 4 edge case) so a slow batch delays deletion instead of destroying data. - Retaining unanonymized IPs to extend the window. Keeping raw IPs for 90 days to "have more data" turns a retention policy into a compliance liability. Anonymize at write-time (Step 1); the crawl-relevant fields survive and the privacy clock never starts.
- Omitting
delaycompress. Compressing the just-rotated file forces real-time parsers to stream.gzimmediately and risks corrupting a file Nginx may still hold open. Always keep the newest rotated generation uncompressed.
Frequently Asked Questions
How long should Nginx logs be retained for accurate SEO analysis?
Keep raw, uncompressed logs for about 14 days so the slowest batched parser cycle completes, then hold compressed archives to roughly 90 days for historical crawl-trend analysis. Tie the upper bound to your documented retention policy and to privacy limits — anonymizing IPs at write-time lets you keep the structural crawl data longer without retaining personal data.
Does log compression affect crawl budget parser accuracy?
No, provided your parser supports gzip decompression and you use delaycompress. Delayed compression keeps the most recent rotated file readable for real-time ingestion while older files are squeezed for storage. Corruption only appears when a file is compressed while Nginx still holds its descriptor, which delaycompress prevents.
Can I automate log archiving without breaking Nginx or losing bot data?
Yes. Use a postrotate block that sends SIGUSR1 to the Nginx master so it reopens log files without dropping connections, and gate archival on a parser-completion marker so files are never deleted before ingestion confirms it read them. The reopened line in the Nginx error log is your proof the handoff worked.
Related Guides
- Configuring logrotate for High-Traffic Sites — the rotation mechanics that enforce this retention window at scale.
- Log Storage & Archival Best Practices — tiered cold-storage patterns for the compressed end of the window.
- Privacy & GDPR Compliance for Logs — why write-time anonymization bounds how long you may retain logs.
Part of the Log Retention Policies series.